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Edge  Detection  and  Geometric  Methods  in 
Computer  Vision 

A.  Peter  Blieher 

Abstract 

Basic  problems  of  vision  are  studied  from  the  viewpoint  of  modern  differential  topology 
and  geometry;  primarily:  edge  detection,  stereo  matching,  picture  representation  at 
multiple  scales,  and  motion.  Some  mathematical  background  is  provided  Tor  the  non* 
expert. 

A  comprehensive  review  of  edge  detection  is  presented,  including  analyses  from  a  math¬ 
ematical  perspective  as  well  as  evaluations  of  practical  performance  and  promise. 

Some  new  edge  detection  techniques  are  introduced,  including  a  nonlinear  operator  based 
on  a  symmetry  principle,  a  variational  approach  to  global  edge  finding,  and  a  least-squares 
localisation  method.  A  theorem  is  proved  which  shows  that  localising  edge  position  and 
orientation  requires  at  least  2  orientation  dependent  families  of  convolution  operators. 

A  unifying  mathematical  structure  is  presented  Tor  vision,  notably  stereo,  motion  stereo, 
optic  flow  (apparent  flow  of  visual  space  under  motion),  and  matching.  The  general 
matching  problem  is  analysed,  and  it  is  proved  that  gcncrically,  general  matching  is 
highly  degenerate  for  monochrome  pictures,  but  has  a  unique  solution  for  2  or  more 
color  dimensions.  The  result  is  extended  to  pictures  with  unknown  bias  and  gain.  Smale 
diagrams  and  level  set  topology  arc  introduced  as  invariants  useful  for  matching,  reducing 
the  problem  to  graph  or  tree  matching.  The  level  set  topology  tree  is  also  proposed 
as  a  method  of  multi-scale  description  of  the  picture,  and  shown  to  be  an  invariant 
generalisation  of  the  "scale  space”  technique. 

The  motion  problem  is  analysed  using  Lie  group  methods,  and  a  theorem  is  proved 
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establishing  that  generically  6  simultaneous  values  of  time  derivative  of  the  monochrome 
picture  function  are  necessary  and  sufficient  to  uniquely  specify  the  3-dimensional  rigid 
motion  of  a  generic  given  object.  For  2  or  more  color  dimensions,  this  is  reduced  to  values 
at  3  points  in  the  picture. 
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Preface 


I  wrote  this  work  for  an  audience  of  both  vision  workers  and  mathematicians.  There  are 
many  people  who  could  be  counted  in  both  groups,  but  the  full  intended  audience  has 
a  wide  spectrum  of  backgrounds.  Among  vision  workers,  I  include  students  of  biological 
vision,  but  it  is  artificial  vision  that  I  am  directly  addressing.  It  is  widely  appreciated  that 
there  arc  many  common  problems,  but  I  haven’t  written  about  any  specifically  biological 
problems,  such  as  explaining  the  function  of  some  cell  population.  My  hope  is  not  only 
to  communicate  research  results,  but  to  convince  readers  in  each  of  these  fields  that 
there  is  something  or  interest  to  them  in  the  other.  Many  people  have  a  bad  taste  from 
previous  experience  with  loutcrs  of  fancy  mathematics  in  these  concrete  situations,  with 
elixirs  which  turned  out  to  be  oversold  or  plain  irrelevant.  Mathematics  is  not  magic; 
using  it  doesn’t  melt  all  impediments  into  triviality.  It  merely  provides  a  structure  for 
understanding,  and  an  apparatus  for  resolving  questions.  If  the  questions  arc  the  right 
grist  for  the  mill.  Attracting  these  Helds  to  each  other  isn’t  necessarily  easy. 

Also,  it  poses  a  problem  in  writing;  since  1  have  tried  to  keep  the  material  accessible  to  the 
novice,  some  of  it  must  necessarily  bo  old  hat  to  the  expert,  so  I  apologise  to  the  expert 
whom  I  have  subjected  to  the  obvious.  I  have  tried  to  make  this  work  reasonably  self* 
contained,  including  various  standard  definitions  and  results  from  differential  topology 
and  geometry.  When  these  arc  not  in  the  main  line  or  thought,  they  have  been  relegated 
to  fine  print,  bo  they  can  be  easily  skipped  by  those  who  already  have  the  necessary 
background.  Sometimes,  standard  terms  arc  used  before  they  arc  defined,  and  sometimes 
they  arc  defined  twice,  partially  from  a  lack  of  organisation,  but  primarily  to  locate 
the  mathematical  digressions  where  they  arc  most  important,  and  avoid  bogging  things 
down  where  they  arc  not.  I  haven’t  tried  to  be  exhaustive  in  this,  or  I  would  have 
been  obliged  to  include  a  complete  introduction  to  differential  topology  and  geometry, 


something  which  has  already  been  accomplished  with  great  skill  by  others.  The  chapter 
Geometric  Methods  in  Vision  makes  the  heaviest  use  of  abstract  mathematics;  therefore 
I  have  put  most  mathematical  background  material  into  the  fine  print  of  that  chapter. 
Since  I  have  assumed  some  of  that  background  material  in  earlier  chapters,  the  reader 
may  And  it  useful  to  glance  through  it  to  clarify  the  unfamiliar,  such  as  the  implicit 
function  theorem,  or  functional  notation. 

The  3  major  chapters  (A  Survey  of  Edge  Detection,  Contributions  to  Edge  Detection, 
and  Geometric  Methods  in  Vision)  arc  largely  independent,  and  can  be  read  in  any  order, 
or  in  isolation.  The  survey  has  many  discussions  which  go  beyond  summarizing,  and 
should  be  of  interest  to  readers  who  arc  already  familiar  with  the  literature,  as  well  as  to 
newcomers.  The  contributions  chapter  is  probably  of  most  interest  to  specialists,  while 
the  geometry  chapter  is  likely  to  appeal  to  the  more  mathematically  inclined. 

Stanford,  California 
October,  1084 
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systems.  The  theory  tor  nonlinear  systems  is  not  as  well  developed,  and  not  so  widely 
known.  Who  can  say  that  there  is  not  some  other  isometry  that  is  more  appropriate  for 
a  nonlinear  system?  But  how  is  a  physiologist  or  psychologist  to  seek  evidence  for  such 
an  unknown  object? 

Finally,  a  third  motivation  for  edge  finding  is  based  on  the  computational  consideration 
of  efficiency.  Since  boundaries  are  of  a  smaller  dimension  than  images  or  regions,  they 
are  easier  to  handle:  e.g.  (for  a  smooth  boundary,  rather  than  a  fractal)  if  the  number  of 
boundary  points  increases  as  O(n),  with  1/n  the  discretization  interval  (grid  size),  then 
the  number  of  region  points  increases  as  0(n2).  Also,  the  1-diincnsionality  of  a  boundary 
provides  a  natural  ordering  for  its  points,  which  is  easily  translated  to  a  processing  order 
for  a  sequential  algorithm.  While  the  entirety  of  an  image  is  filled  with  region  points, 
only  a  small  fraction  constitute  edge  points. 

The  sparsity  of  edge  points  among  image  points  is  a  major  attraction  of  edge  detection 
as  an  early  step  in  stereoscopic  vision,  since  it  extremely  diminishes  the  size  of  the  search 
involved  in  matching  points  between  the  two  images.  Of  course,  this  is  only  useful  if  the 
edge  points  bear  some  relation  to  things  rigidly  attached  to  fixtures  in  the  world,  so  as  to 
vindicate  the  assumption  that  edge  points  must  match  edge  points.  As  it  happens,  this 
seems  in  fact  to  be  a  good  assumption,  and  edge  points  appear  to  be  more  stable  than 
more  rudimentary  features  of  intensity,  such  as  the  actual  brightness  values. 

So  far  we  have  given  a  very  general  definition  of  edge  detection  as  finding  the  geographical 
limits  of  a  description.  It  is  probably  fair  to  say  that  few  if  any  authors  or  edge  detection 
methods  thought  that  was  what  they  were  doing.  The  universal  goal  of  edge  detection 
algorithms  is  to  find  places  in  the  image  which  a  human  would  classify  as  “edges"  or 
“boundaries."  We  apologize  if  that  seems  a  trivial  statement,  but  since  we  do  not  know 
how  a  person  segments  a  scene,  we  arc  in  no  position  to  give  an  authoritative  definition  of 
what  constitutes  an  “edge.”  Everyone  agrees  that  a  transversely  translated  step  function 
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In  more  familiar  termsj  the  description  is  nothing  more  than  a  model  of  the  data  in 
some  system  of  representation  of  the  pertinent  knowledge.  Including  boundaries  or 
regions  per  se  in  the  representation  language  makes  vacuous  the  notion  of  homogeneity. 
Our  observations  above  are  merely  rewordings  of  the  thesis  that  humans  have  a  very 
rich  representation  language  available,  while  machines  as  yet  do  not.  Incidentally,  the 
term  representation  language  is  meant  to  refer  to  the  internal  representation,  whether 
it  be  essentially  symbolic,  continuous,  or  whatever,  and  shouldn’t  be  confused  with  the 
language  wc  use  to  communicate  about  the  aspects  of  the  representation  introspcctively 
available  to  us. 

The  problem  of  finding  homogeneous  regions  can  be  approached  cither  by  finding  the 
regions  directly,  whose  difliculty  increases  with  the  complexity  of  description,  or  by 
finding  the  boundaries  between  the  regions.  Thus  the  first  motivation  for  edge  finding  is 
for  the  purpose  of  segmentation  into  homogeneous  regions  in  accordance  with  our  own 
models  of  the  world.  It  is  based  on  introspective  observations. 

A  second  motivation  is  derived  from  extrospective  observations  which  have  been  made 
by  physiologists,  perception  psychologists  and  perceptual  psychophysicists.  Anatomical 
structures  have  been  found  which  respond  to  abrupt  changes  in  intensity  and  color  as 
functions  both  of  position  and  time.  Observations  have  been  made  that  indicate  absolute 
colors  and  intensities,  as  well  as  their  slow  changes  are  not  readily  perceived,  but  abrupt 
changes  are.  Whether  it  is  wise  to  mimic  nature,  or  rather  to  attempt  to  mimic  the 
precious  little  wc  think  wc  know  about  nature,  is  problematic,  despite  the  widespread  tacit 
acceptance  or  the  idea.  It  is  worth  considering  that  wc  arc  unlikely  to  find  physiological 
processes  involved  with  things  that  arc  not  already  a  part  of  our  introspective  models.  For 
example,  wc  look  for  evidence  that  the  visual  system  performs  Fourier  transforms,  since 
Fourier  transforms  have  a  particular  intuitive  appeal.  Hut  they  can  also  be  viewed  as  only 
one  of  a  myriad  or  possible  isometrics  of  a  function  space,  special  because  they  transform 
convolutions  to  multiplications.  But  that  special  property  is  mainly  significant  for  linear 
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way  to  describe  such  a  iy;terogeneous  object  is  still  to  partition  it  into  homogeneous  com* 
ponents.  The  tendency  is  always  to  subdivide,  perhaps  reflecting  the  reductionist  ethos 
widespread  in  modern  science.  One  might  expect  that  in  artificial  intelligence  jargon,  this 
process  would  be  called  the  chotomizing  heuristic,  in  fact,  it  is  called  segmentation.  (Or 
borrowing  from  Caesar,  subdivide  and  conquer.) 

The  products  of  the  subdivision  arc  homogeneous  entities.  For  a  human,  the  homogeneity 
is  one  of  description,  while  for  the  machine  it  is  generally  one  of  measurement.  Now,  a 
measurement  is  a  description,  and  a  set  of  descriptions  is  a  description,  so  we  have  to 
explain  what  we  mean  by  these  terms  a  little  more  precisely.  Ry  a  description,  we  mean 
something  which  might  be  quite  complicated  from  a  machine  perspective,  encompassing 
such  explicit  descriptions  as  “it  gets  darker  and  redder  from  right  to  left,  with  a  speckling 
that  looks  like  that  on  a  trout,  but  which  fades  into  a  very  dense  ncLwork  of  lines  in  the 
periphery.”  That  would  not  generally  be  found  to  be  a  homogeneous  region  by  a  machine. 
A  measurement,  on  the  other  hand,  is  meant  to  connote  something  very  close  to  the 
language  of  the  transducer  providing  the  image  data,  c.g.  brightness  or  range  values. 
Most  of  the  definitions  of  homogeneity  implicit  in  automatic  segmentation  programs 
stray  little  from  a  constancy  of  such  a  measurement,  though  the  situation  appears  to  be 
improving. 

Ry  demanding  a  homogeneous  description  to  define  a  homogeneous  region,  we  mean  that 
the  description  can  have  no  explicit  mention  or  boundaries  or  constituent  regions.  The 
relatively  weak  condition  of  explicitness  is  right  because  an  implicit  boundary  would 
not  be  a  property  or  the  description,  but  rather  something  inferred  from  it.  Thus  a 
description  such  as  “the  value  goes  linearly  from  100  in  the  lower  left  to  —100  in  the 
upper  right”  would  be  judged  homogeneous,  despite  the  well  defined  diagonal  boundary 
separating  positive  from  negative  values.  For  the  lime  being,  we  arc  content  to  include 
as  homogeneous  such  descriptions  as  “the  intensity  goes  as  a  step  function. . .” 
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Introduction 

Why  edge  detection? 

Edge  detection,  in  the  form  of  spatial  differentiation,  appears  in  the  computer  vision 
literature  as  early  as  1955  [Dinccn  1955].  This  early  and  sustained  interest  arises  from 
a  perception  that  the  types  or  patterns  significant  to  a  visual  system  consist  of  ap¬ 
proximately  homogeneous  regions  separated  by  abrupt  boundaries.  Although  years  of 
experience  have  shown  that  real  digitized  scenes  arc  not  easily  characterized  this  way, 
the  idea  has  persisted  tenaciously,  for  the  following  reasons. 

First,  from  an  introspective  point  of  view,  one  tends  to  believe  the  world  to  be  composed  of 
objects,  each  homogeneous  in  its  cohesion,  and  abruptly  separated  from  other  objects  and 
the  background.  That  is  an  essential  aspect  of  our  way  of  perceiving  the  world,  pervading 
disciplines  from  anatomy  (where  every  bump,  nodule,  fascia,  anil  tissue  type  is  seen 
as  a  separate  structure)  to  quantum  mechanics  (in  which  an  essentially  continuous  all- 
pervasive  field  is  seen  to  describe  a  separate  localized  particle).  Whether  this  discretization 
of  the  world  is  a  part  of  the  structure  of  the  world  or  or  ourselves  we  cannot  say  (arguably 
it  is  impossible  to  say);  nevertheless,  it  ‘ib  here  to  Btay. 

Now  close  inspection  of  digital  images,  or  for  that  matter,  paintings  from  past  centuries, 
leaves  little  doubt  that  the  image  of  a  single  (realistic  looking)  object  can  but  rarely  be 
described  as  “homogeneous.”  Yet,  even  upon  making  such  an  observation,  one’s  natural 


Some  mathematical  background  which  is  assumed  in  this  chapter,  such  ia  functional  notation  and  some 
results  from  differential  topology,  is  explained  in  more  detail  in  the  fine  print  of  the  chapter  Geometric 

Methods  in  Vision. 
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in  understanding  how  the  geometry  and  topology  interact,  and  I  began  by  studying  the 

# 

consequences  of  topology.  This  led  to  results  which  intertwine  color  vision  with  stereo, 
and  which  clarify  the  role  of  geometric  constraints  in  monochrome  stereo  vision. 

The  main  invariant  structure  in  these  studies  was  the  family  of  level  sets;  each  level  set 
is  the  set  of  points  in  the  picture  that  all  have  the  same  brightness  or  color.  Differential 
topology  has  studied  this  structure  thoroughly,  so  we  were  able  to  say  some  things  about 
how  the  level  sets  fill  up  the  picture,  and  what  happens  when  the  picture  is  perturbed. 
This  forms  a  departure  point  for  characterizing  the  picture  function  which  is  independent 
of  viewpoint.  It  also  is  related  to  smoothing  the  picture,  and  some  other  operations 
which  people  have  applied  to  find  features  in  pictures,  e.g.  zero  crossings  of  Laplacians 
of  Gaussians. 

The  next  step  was  to  study  the  cfTccts  of  the  geometry.  In  the  spirit  or  modern  geometry, 
I  approached  this  by  studying  the  action  of  a  Lie  group,  i.c.  by  looking  abstractly  at 
the  effect  or  rigid  motion.  This  type  of  problem  is  usually  easier  in  its  differential  form, 
i.c.  for  infinitesimal  motions,  so  that  was  the  best  place  to  start.  This  put  us  into  the 
business  of  studying  motion  rather  than  discrete  views.  A  family  of  basic  questions  is 
how  much  can  be  deduced  about  the  motion  from  how  much  data.  The  particular  one 
or  these  questions  that  I  studied  was  based  on  the  idea  that  the  raw  data  consists  only 
of  brightness  or  color  values  at  fixed  places  on  the  retina,  along  with  data  on  how  they 
are  changing  with  time.  This  is  somewhat  different  than  the  approach  many  have  taken 
in  the  past,  where  the  goal  has  either  been  to  find  the  3-dimcnsional  motion  from  the 
motion  of  points  on  the  retina,  or  else  to  find  that  motion  of  individual  retinal  points 
itself.  We  were  able  to  show  that  generally  6  data  points  of  our  kind  arc  enough  to 
specify  the  motion  of  a  given  surface.  Again,  this  was  related  to  color.  The  number  6  is 
for  monochrome  data;  for  color  data,  3  points  arc  enough. 

The  2nd  main  topic  of  this  thesis,  geometry  applied  to  vision,  thus  comprises  some 
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suited  to  scrutiny  under^the  macroscope  of  modern  geometry.  The  “real”  objects  of  vision 
are  objects  embedded  in  3-dimcnsional  space.  They  arc  subjected  to  various  lighting 
conditions,  and  viewed  in  a  variety  of  ways.  It  is  the  properties  of  the  real,  solid  objects 
that  we  must  deduce  from  the  image;  so  we  must  study  how  the  properties  express 
themselves,  and  come  to  understand  their  invariances,  especially  invariances  of  qualitative 
features.  I  use  “qualitative”  in  the  sense  of  “qualitative  dynamics,”  where  understanding 
comes  from  topological  descriptions,  still  quite  rigorous,  rather  than  from  some  formula 
which  allows  us  to  calculate  the  precise  numerical  values  describing  the  state  of  a  system. 
I5.g.,  we  would  rather  know  where  (and  whether)  an  object’s  shape  changes  (say  from 
convex  to  concave),  than  know  the  terms  of  a  polynomial  that  specifics  that  shape. 

The  route  of  study  then  became  to  approach  the  specific,  e.g.  edge  finding,  by  first 
understanding  the  overall  problem.  My  first  step  was  simply  to  write  down  what  the 
spaces,  maps,  and  groups  were  that  were  involved.  This  provided  a  structure  in  which  to 
apply  formal  results  of  mathematics. 

A  basic  Tact  of  life  in  seeing  3-dimcnsional  objects  on  a  2-dimcnsional  retina  is  that  as 
we  change  our  viewpoint,  or  as  the  objects  move,  their  2-dimcnsional  images  undergo 
distortions.  Understanding  3  dimensions  from  this  2-dimcnsional  world  must  involve 
recognising  an  object  despite  these  distortions,  and,  what’s  more,  interpreting  the  dis¬ 
tortions  to  deduce  the  shape  of  the  object  and  its  relative  motion.  The  exact  distortion 
that  the  picture  undergoes  depends  on  the  shape  of  the  object,  the  motions  of  object  and 
observer,  and  the  optics  that  produce  the  picture. 

Stereoscopic  vision  conventionally  starts  with  2  pictures  from  different  views  and  requires 
finding  places  in  the  2  pictures  that  correspond  to  a  single  place  in  the  3-dimcnsional 
scene.  When  this  is  done  for  enough  places  in  the  pictures,  it  allows  triangulation  to 
find  depth  (i.c.  the  3rd  dimension).  A  basic  problem  is  to  study  the  invariants  which 
allow  such  matching  to  take  place.  Since  the  distortions  can  be  complex,  I  was  interested 


Introduction 


5 


rifts,  ridges,  dips,  etc.  are,  and  how  they  interlock.  Then,  once  the  image  is  “understood” 
this  way,  maybe  to  the  point  of  hypothesizing  objects,  some  regions  may  take  on  special 
importance  as  “edges.”  In  this  view,  while  l-dimcnsional  objects — edges — arc  important 
for  representing  what’s  in  the  image,  they  are  a  result,  not  a  first  step,  of  understand* 
ing.  This  is  a  somewhat  heretical  point  of  view,  and  it  is  by  no  means  certain.  But  I 
became  convinced  that  the  understanding  of  local  image  features,  e.g.  labelling  some 
features  as  edges,  depended  on  getting  a  qualitative  global  understanding  of  the  image. 
When  I  say  “global”  here,  I  don’t  mean  that  one  has  to  understand  the  whole  area  of 
the  picture,  but  rather  a  large  enough  urea  that  the  most  local  measurements  can  be  put 
into  a  context.  For  example,  from  a  single  view  through  a  tiny  peephole  one  might  say 
something  about  which  way  the  shading  is  changing,  but  it  takes  a  larger  field  of  view 
to  say  something  about  the  3-dimcnsional  object  involved — whether  we  are  looking  at  a 
convexity  or  concavity,  a  fold,  an  edge,  an  uninteresting  shading  gradient,  or  some  other 
solid  feature. 

The  enterprise  of  computer  vision  seeks  to  duplicate  a  feat  we  know  from  introspection, 
but  to  duplicate  it  by  cold  mathematical  means.  There  arc  many  styles  of  research,  but  in 
my  thinking,  this  enterprise  is  most  likely  to  succeed  ir  the  mathematical  setting  and  the 
questions  being  posed  arc  stated  explicitly  and  precisely.  Often,  in  fact,  finding  the  right 
way  to  state  a  problem  turns  out  to  be  a  cornerstone  of  the  solution.  I  had  come  to  sec  the 
problem  as  one  of  describing  the  image  appropriately  on  increasingly  global  scales,  and 
piecing  together  the  descriptions  to  arrive  at  local  interpretations.  This  makes  it  essential 
to  find  the  right  ways  to  understand  the  picture  function  for  the  goal  of  understanding 
3*dimcnsional  structure.  Differential  topology  and  geometry  seemed  to  be  the  right  places 
to  look. 

Vision  is  teeming  with  geometry:  the  image  comes  from  a  map  from  a  higher  dimensional 
space  with  important  singularities,  there  arc  natural  group  actions,  e.g.  the  rigid  motion 
group,  topological  invariance  in  the  image  is  important,  etc.  These  arc  things  perfectly 
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through  a  very  small  peephole. 

Since  the  very  beginnings,  researchers  who  have  built  vision  systems  which  did  some 
higher  level  recognition  tasks  have  been  sorely  limited  by  the  abilities  of  the  lower  level 
algorithms  they  used  for  input.  This  was  a  problem  for  ACRONYM  [Brooks  1981]  no  less 
than  for  [Waltz  1972].  Waltz  created  a  system  which  was  able  to  understand  line  drawings 
of  toy  blocks.  The  presumption  was  that  low-level  vision  could  supply  the  line  drawings. 
But  it  had  turned  out  to  be  very  difficult  to  obtain  line  drawings  good  enough  to  use 
as  input,  even  in  the  blocks  world  domain  of  uniform  matte  surfaces  in  good  lighting. 
This  problem  had  already  spawned  the  c (Torts  of  Binford-llorn  [Horn  1972]  and  then, 
later,  or  [Shirai  1973].  10  years  later,  [Brooks  1981]  used  the  state-of-the-art  line  finder 
of  [Nevatia  and  Babu  1978],  and  round  that  he  had  to  draw  inferences  based  on  almost 
laughably  meager  low-level  output.  Today,  reliable  segmentation  (dividing  an  image  into 
meaningful  parts)  remains  a  paramount  obstacle  to  image  understanding. 

Hence  I  was  drawn  to  edge  detection  as  a  basic  problem  which  might  yield  to  a  math¬ 
ematical  approach.  I  found  that  people  had  applied  a  great  patchwork  or  techniques, 
but  that  the  problem  itself  was  very  poorly  understood.  ICdges,  it  seems,  arc  a  lot  like 
obscenity,  for  as  Mr.  Justice  I'ollcr  Stewart  wrote  of  obscenity  [ Jacob  cilia  v.  Ohio  1964], 
he  may  not  be  able  to  define  it,  “But  I  know  it  when  I  see  it."  ICvcryonc  agrees  that  a 
perfect  step  function  should  give  an  edge,  but  there  has  been  no  adequate  criterion  put 
forth  to  classify  any  other  function  as  edge  or  non-edge.  There  was  no  viable  theory 
to  bridge  the  gap  between  the  local  methods  of  the  peephole  and  the  global  objects  we 
think  of  as  edges,  if  indeed  the  global  must  even  first  come  from  the  local.  I  eventually 
realized  that  the  problem  of  edge  detection  was  first  a  problem  oT  understanding  the 
image  intensity  function,  a  qualitative  understanding  which  must  be  suited  to  the  needs 
of  vision.  In  fact,  I  wonder  if  edge  detection  is  a  bit  of  a  red  herring.  The  best,  sharpest 
edges  arc  easy  enough  to  find,  all  right,  but  it  seems  that  the  global  picture  may  require 
knowing  how  all  the  local  qualitative  features  of  the  image  fit  together:  where  the  bumps, 
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anywhere  within  range.. 

Someone  who  has  never  tried  to  write  a  vision  program  is  likely  to  think  that  it  really 
can’t  be  all  that  hard.  The  act  of  vision  is  so  effortless  for  us,  so  transparent,  that  it 
is  hard  at  first  to  imagine  that  anything  at  all  must  be  done  to  bring  it  ofT,  beyond 
the  initial  transduction.  You  digitize  what’s  out  there;  the  objects  arc  delineated  by 
their  borders,  which  are  the  places  where  things  change  suddenly,  so  you  look  for  the 
places  of  rapid  change,  figure  out  what  the  objects  are,  and  voili!  all  done.  In  fact, 
a  lot  of  research  was  based  on  that  paradigm.  The  illusion  of  simplicity  seduced  many 
people  into  thinking  that  it  was  a  programming  problem  like  many  others,  which  could 
be  solved  by  doing  some  intuitively  obvious  things,  followed  by  bug- fixing,  honing,  and 
tuning.  Unfortunately,  life  is  not  so  easy.  ^ 

One  of  the  hardest  things  to  appreciate,  even  to  describe,  is  what  it  means  to  know 
what  is  in  an  image.  In  some  sense,  the  set  of  all  the  pixel*  values  already  has  all  the 
information  about  the  image.  But  there  is  no  knowledge  that,  i.c.  no  symbolic  knowledge. 
You  can’t  know  about  the  relation  between  any  2  pixels  unless  those  pixels  talk  to  each 
other  somehow.  From  another  perspective,  knowing  all  the  pixel  values  is  no  better  than 
knowing  them  one  at  a  time  as  completely  focofinformation  but  what  we  really  need  is 
global  information.  And  the  global  information  needed  must  be  exactly  that  information 
that  lets  us  draw  inferences  about  the  physical  situation  that  produced  the  image.  Global 
information  is  very  hard  to  obtain  because  a  picture  contains  a  lot  of  data— around  256K 
bytest  for  a  normal  TV  frame;  on  the  order  or  100,OO0K  bytes  for  the  human  retina  (for 
comparison,  a  page  of  text  in  a  book  is  around  IK  byte)  — and  the  space  or  patterns  to 
consider  is  of  high  dimension.  Most  people  have  .approached  this  complexity  problem  by 
trying  to  extract  information  for  very  small  regions,  from  a  few  to  a  few  hundred  pixels, 
and  to  use  this  information  for  only  a  few  such  regions  at  a  time.  This  is  like  looking 

*  Pixel  is  a  contraction  of  picture  element,  a  single  point  of  data  in  the  picture, 
t  We  take  a  byte  to  consist  of  8  bite,  where  a  bit  is  a  single  binary  digit. 
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very  cleverly;  but  isn’t  (he  act  of  the  programmer,  the  act  of  creating  the  symbolic  data 
base,  a  fundamental  act  of  abstraction  and  hence  of  intelligence?  These  linguistic  entities 
tend  to  be  like  things  that  people  would  say,  compact  statements,  but  how  much  of  a 
thought  or  worldly  experience  can  be  captured  this  way?  So  one  must  worry  about  what 
knowledge  is,  what  knowledge  is  needed,  and  how  to  represent  it.  What  is  often  neglected 
is  the  question  of  how  that  knowledge  can  come  to  be  known.  One  way  is  to  hire  a  team 
of  knowledge  engineers  who  spend  months  or  years  codifying  the  knowledge  of  an  isolated 
domain  into  a  formal  system  accessible  to  the  machine.  In  the  long  run  for  AI,  this  has  to 
be  a  losing  proposition:  how  long  would  it  take  you  to  write  down  everything  you  know? 
And  that  only  counts  the  things  that  you  know  you  know  and  you  know  how  to  express. 
There’s  no  substitute  for  experience. 

Experience  is  the  only  possible  way  to  amass  a  data  base  that  can  be  said  to  have  “world” 
knowledge.  Experience  must  be  abstracted,  perhaps  in  many  stages  and  many  ways,  to 
yield  the  data  structures  used  by  the  higher  processes,  perhaps  abstracted  even  to  yield 
the  very  processes.  The  rope  or  mind  has  2  ends:  what  do  I  need  to  know  to  be  able  to 
reason,  and  what  can  I  say  about  what’s  happening;  and  it  has  to  be  spliced  somewhere 
in  the  middle  to  connect  the  outside  world  with  the  inner  one.  Perception  must  be 
able  to  produce  the  data  structures  required  for  reasoning.  In  fact,  given  our  meager 
understanding  of  intelligence,  we  can’t  really  draw  a  line  between  perception  and  reason. 
Maybe  there  is  none.  After  all,  the  relatively  “minor”  ability  of  perception  has  so  far 
proven  vcxingly  intractable. 

Among  our  senses,  vision  is  probably  the  richest  and  most  important.  Only  vision  and 
hearing  have  well-developed  transducer  technologies,  making  them  readily  accessible  to 
attack  by  computer.  The  problems  or  hearing,  particularly  speech  understanding,  are 
no  less  than  those  of  vision,  but  I  happen  to  be  more  visually  than  aurally  oriented,  and 
vision  has  more  obvious  connection  to  geometry  and  topology,  so  it  was  vision  that  I 
found  myself  working  in.  Also,  there  was  a  vision  group  at  Stanford,  but  no  speech  effort 
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This  work  is  mainly  about  2  topics  in  computer  vision: 

•  Edge  detection 

•  Applications  or  geometrical  methods. 

The  “geometrical  methods'’  arc  those  of  modern  differential  topology  and  geometry. 

I  came  to  do  this  research  in  pursuit  of  the  eventual  goal  of  understanding  and  building 
intelligence. 

An  intelligent  creature,  whether  oF  llcsh  or  metal,  must  be  able  to  know  what  is  going 
on  around  it,  and  do  something  about  it.  Those  are  the  peripheral  Functions:  perception 
and  action.  These  are  certainly  necessary,  but  aren’t  they  rather  minor  in  comparison 
to  the  “higher”  Functions  involved  in  thinking,  reeling,  learning,  language,  etc.?  This  is 
an  interesting  question)  but  it  isn’t  just  this  simple  necessity  oF  perception  that  led  me 
to  its  study. 

A  great  deal  oF  artificial  intelligence  (Al)  research  studies  the  higher  Functions,  and  with 
varying  degrees  oF  success,  tries  to  duplicate  them.  I  find  a  curious  thread  running  through 
much  oF  this  work:  the  manipulation  oF  linguistic  entities.  People  have  long  said  that  the 
main  thrust  oF  AI  is  symbol  manipulation,  and  indeed  it  seems  that  to  be  smart  you  should 
be  able  to  IransForm  data  into  abstractions,  and  abstractions  arc  symbols,  which  in  some 
sense  arc  linguistic  entities,  abstractly  at  least.  The  linguistic  entities  oF  AI  tend  to  be 
statements  with  a  great  deal  oF  meaning  to  the  programmer,  such  as  (DUCK  IS-A  BIRD), 
but  the  machine  hasn’t  the  least  interest  in  what  the  symbols  stand  For  in  the  w  rid. 
The  AI  program  endows  the  machine  with  a  means  to  manipulate  these  symbols,  perhaps 
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ought  to  be  called  an  edge.  This  corresponds  to  a  boundary  between  regions  of  uniform 
intensity  measurement,  uniform  at  least  near  the  boundary.  Very  little  attention  has  been 
paid  to  any  other  definition  of  “edge,”  despite  the  fact  that  close  observation  of  images 
reveals  that  step  edges  between  uniform  (strip)  regions  are  exceedingly  rare.  This  is  not 
to  say  that  edge  detectors  built  to  detect  step  edges  don’t  find  real  edges;  indeed  they 
often  do,  and  indeed  they  often  make  grievous  errors. 

The  term  “edge”  has  been  fairly  widely  abused,  and  we  will  continue  that  tradition  here. 
One  type  of  edge  is  that  resulting  from  the  boundary  of  some  object.  There  arc  also 
edges  which  arc  merely  boundaries  between  surface  features.  There  arc  local  edges  and 
global  edges,  which  arc  frequently  called  contours.  Ix>cal  and  global  are  relative  terms, 
and  we  mean  them  in  comparison  either  to  image  or  grid  size.  A  local  part  of  a  curve,  for 
example,  would  be  well  approximated  by  a  straight  segment  in  the  given  grid  size.  Thus 
another  way  of  looking  at  the  difference  between  local  and  global  is  related  to  manageable 
and  unmanageable  search  problems,  since  locally  all  possible  curves  can  be  represented  as 
all  possible  line  segments  on  a  coarse  grid,  while  globally  the  space  of  all  possible  curves 
is  vast. 

Local  edge  detection 

We  will  not  attempt  to  give  a  mathematically  precise  as  well  jus  operationally  general 
definition  of  “edge"  here.  Properly,  to  do  so  one  would  study  the  imaging  process  as 
well  as  real  images.  [Ilcrskovits  and  Binford  1970]  did  so  to  a  limited  extent,  presenting 
essentially  (-dimensional  results.  ICssentijvlly,  what  people  Imvc  been  looking  for  as  edges 
arc  places  with  a  large  gradient,  or  places  which  resemble  a  step  function  in  cross-section. 
So-called  “roof”  edges,  modelled  as  a  discontinuity  in  1st  derivative  have  been  sought 
as  well.  It  turns  out  that  a  number  or  different  outlooks  on  how  to  look  for  these 
features  lead  to  essentially  the  same  computational  technique,  viz.  convolution  with  some 
kernel  followed  by  thresholding.  (Strictly  speaking,  it  is  usually  cross-correlation  which 
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is  implemented,  but  sinie  the  families  of  kernels  involved  are  complete  under  inversions, 
we  take  liberties  with  the  term  “convolution.”) 

Spatial  differentiation  and  gradient  estimation 

If  edges  arc  places  where  things  change  fast,  then  the  obvious  way  to  look  for  them  is 
by  performing  a  spatial  differentiation.  This  may  be  done  by  some  discrete  analog  of  the 
gradient,  which  is  implemented  by  convolving  with  a  kernel  of  small  support.  The  smallest 
possible  support  Tor  a  differentiation  is  2  pixels,  and  in  Ruch  a  case  the  convolution  is  often 
thought  of  as  taking  adjacent  pixel  differences,  or  first  differences.  Larger  supports  allow 
more  creativity  in  the  choice  of  the  convolution  kernel  defining  the  differentiation,  and 
provide  the  benefit  of  improved  noise  behavior.  A  great  many  authors  estimate  gradient 
or  “stepness”  by  computing  adjacent  pixel  differences.  [Martclli  1972,  Marlclli  1973]  and 
[Turner  1974]  arc  examples  of  the  latter.  Another  way  to  think  of  the  gradient  is  as  a 
derived  parameter  of  fitting  a  plane  to  the  data.  For  sufficiently  symmetric  supports, 
this  can  also  be  implemented  as  convolutions.  In  fact  many  outwardly  sophisticated 
techniques  have  as  their  core  the  estimation  of  gradient. 

Template  matching  and  matched  filtering 

A  popular  way  to  look  for  features  is  with  a  matched  filter  or  template,  and  this  is 
quite  common  for  step  edges.  Again  the  cross-correlation  with  the  template,  or  the  space 
domain  realization  of  the  filter  are  implemented  as  convolutions.  The  idea  is  that  the 
“template”  (the  convolution  kernel)  is  an  ideal  case  of  the  feature  one  is  seeking,  and 
one  looks  for  large  values  or  the  correlation  as  indicating  the  presence  of  the  feature. 
The  term  “template-matching”  often  suggests  that  the  vector  space  projection  analysis 
of  the  process  is  at  best  a  secondary  consideration.  Examples  arc  the  operators  of  Sobel 
(Duda  and  Hart  1973]  and  (Kirsch  1971],  as  well  as  many  others  (further  examples  can 
be  found  in  (Abdou  1978]  and  (lloscnfcld  and  Kak  1976]).  The  matched  filter  npproach 
is  operationally  the  same,  but  inoludes  the  analytical  idea  that  as  a  consequence  of  the 
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Cauchy-Schwarz  inequality,  the  maximum  response  for  normalised  data  occurs  when 
the  data  is  the  (complex  conjugate  of  the)  template.  [Duda  and  Mart  1973]  provide  a 
more  detailed  discussion  of  the  ideas  of  spatial  differentiation,  gradient  estimation,  and 
template  matching,  with  a  slightly  different  viewpoint.  [Shanmugam,  Dickey,  Green  1979] 
seek  a  slight  generalization  of  the  matched  filter,  in  the  sense  that  the  filter  must  be 
strictly  bandlimited  and  the  objective  is  to  maximize  the  power  of  the  step  response  in  a 
given  space  interval. 

Locally,  i.c.  at  a  single  point  of  the  convolution  result,  the  integration  against  the  kernel 
can  be  thought  of  as  orthogonal  projection  onto  a  1 -dimensional  subspacc  of  Rn,  where 
n  is  the  number  of  pixels  in  the  support  of  the  kernel,  and  the  projection  is  with  respect 
to  the  usual  inner  product  on  Rn.  If  there  is  more  than  1  subspacc  involved,  i.e.  more 
than  one  convolution,  then  one  has  components  which  can  be  thought  or  as  components 
of  a  vector  in  the  space  spanned  by  the  subs  paces.  Then  one  can  compute  a  magnitude 
for  that  vector  (so  as  to  get  a  number  representing  “cdgcncss”  for  thresholding).  The 
magnitude  may  be  in  the  Euclidean  norm 

iwi  =  (E 

or  in  some  other  norm,  such  as  the  max  norm 

||*||  =  max{vi}, 

or  the  sum  norm 

mi = Em- 

Best  edge  fit  and  optimal  estimation 

The  simplest  edge  model,  a  translated  step  function,  has  3  parameters  (for  a  2  dimen* 
sional  picture).  These  might  be,  c.g.,  angle,  left  height,  and  right  height.  Willi  enough 
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normalisation,  these  cap  be  reduced  to  the  single  parameter  of  angle.  Template  match* 
ing  methods  use  a  separate  template  for  each  angle  considered.  But  one  can  also  try  to 
determine  the  angle  that  best  accounts  for  the  data.  Furthermore,  the  model  may  have 
more  parameters,  and  there  may  be  statistical  information  available. 

The  simplest  type  of  best  fit  problem  occurs  when  the  model  space  is  a  linear  subspace 
of  the  data  space,  which  is  an  inner  product  space.  In  that  case,  the  best  fit  is  obtained 
by  orthogonal  projection  to  the  model  space.  This  is  a  very  common  method  for  fitting 
functions  in  1  dimension,  based  or.  the  observation  that  translation  in  space  is  equivalent 
to  multiplication  by  a  complex- valued  function  of  frequency  in  the  frequency  domain,  so 
that  all  the  translates  of  a  given  frequency  component  make  up  a  linear  subspace.  In  2 
dimensions,  though,  matters  arc  complicated  by  the  presence  or  rotations,  so  that  while 
the  same  artifice  applies  to  translations,  the  Fourier  equivalent  of  rotation  is  still  rotation, 
and  the  set  of  all  rotations  of  a  component  is  no  longer  a  (I -dimensional)  linear  subspace, 
so  direct  orthogonal  projection  is  no  longer  applicable.  Hence  many  methods  which  secin 
very  clever  for  I  dimension  fail  for  2  dimensions.  However,  this  nice  property  of  Fourier 
transforms  for  I  dimension  can  be  thought  of  as  a  special  ease  of  a  more  general  principle, 
which  may  be  of  use  in  inventing  best  (it  methods.  Specifically,  one  way  to  restate  the 
spectral  theorem  [Halmos  1957,  Halmos  1953]  is  that  any  normal  operator  in  a  Hilbert 
space  is  unilarily  equivalent  to  a  multiplication.  For  our  purposes  the  Hilbert  space  can 
be  taken  to  be  /,2(Ra).  Then  the  spectral  theorem  can  be  interpreted  to  say  that  given 
a  normal  operator  A,  we  can  find  some  isometry  U  :  l?  —*  and  some  function  <p  €  L* 
such  that  U~l  AU{f)  =  <pf  for  all  /  simultaneously.  IT  A  is  a  translation  operator,  the 
Fourier  transform  is  such  a  unitary  transformation,  .as  we  mentioned  above.  According  to 
the  theorem,  there  is  some  isometry  of  La(R2)  which  will  transform  rotation  into  complex 
multiplication.  Using  that  isometry  like  a  Fourier  transform,  one  could  use  projection 
methods  to  find  best  fits.  Even  better  would  be  a  transform  that  worked  for  translations 
and  rotations  at  the  same  time,  but  that  is  impossible  because  translations  do  not  in 
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general  commute  with  potations  (as  would  clearly  be  necessary  for  the  existence  of  such, 
a  transform  because  multiplication  is  commutative). 

A  slightly  more  general  best  fit  problem  occurs  when  the  model  set  consists  of  an  n- 
paramcter  family  of  functions,  and  the  object  is  to  find  a  member  of  the  family  which 
minimises  some  error  measure  with  respect  to  the  datum.  If  the  family  is  differentiable, 
it  can  be  thought  of  as  a  submanifold  of  the  ambient  space.  Frequently  the  error  measure 
is  a  metric  on  the  space,  and  then  the  problem  is  seen  a a  one  of  finding  the  closest  point 
of  the  model  manifold  to  the  datum.  In  the  case  of  estimation,  there  is  a  probability 
distribution  involved,  and  one  seeks  a  set  of  parameters  minimizing  the  expected  error. 

[AIlcs  1975],  [Ilucckcl  I971,llueckcl  I960],  (O’Gorman  1976],  [Abdou  1978]  find  best  fit 
edges.  Altcs  uses  essentially  the  1-diincnsional  Fourier  method  described  above.  Ilueckel 
and  O’Gorman  minimize  the  distance  between  the  projection  of  data  and  parametrized 
model  onto  a  truncated  orthonormal  basis,  deriving  the  “optimal"  parameters.  However, 
both  the  number  of  parameters  and  the  number  or  terms  in  the  scries  arc  too  small  to 
allow  good  performance.  Altcs  uses  a  more  realistic  edge  model  (in  1  dimension),  but  his 
results  are  not  readily  generalized  to  2  dimensions.  Abdou  finds  the  best  fit  edge  by  what 
is  essentially  an  exhaustive  search  over  a  slightly  more  general  but  still  too  simple  model 
space,  namely  linear  ramps  between  constants. 

When  the  parameters  one  is  seeking  arc  the  coefficients  in  an  orthonormal  basis,  the 
parameters  can  be  obtained  simply  by  taking  the  inner  product  with  the  basis  elements. 

Higher  order  derivatives 

Methods  that  rely  on  estimates  of  the  gradient,  or  whose  response  is  largely  determined 
by  the  gradient  cannot  distinguish  smooth  transitions  from  abrupt  ones.  In  the  ilueckel 
and  O’Gorman  approaches,  for  example  the  early  tcrm(s)  in  the  expansions  arc  essentially- 
the  gradient.  One  approach  to  this  problem  is  to  use  a  preprocessing  step  which  takes 
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linear  functions  to  0.  This  idea  is  advanced  by  [Binford  1981]  in  the  form  of  “lateral 
inhibition,”  and  in  fact  operators  modelled  on  second  and  higher  order  derivatives  will 
have  this  property.  (It’s  interesting  to  consider  just  how  many  such  operators  there 
are.  Suppose  the  operator  support  is  n  pixels.  There  arc  then  n  lineariy  independent 
such  operators.  Requiring  that  all  operators  take  constants  to  0  is  a  linear  constraint, 
and  that  they  take  all  linear  functions  to  0  is  2  more  linear  constraints,  so  there  arc 
n  —  3  linearly  independent  operators  fulfilling  the  constraints.  One  may  impose  further 
constraints  by  requiring  various  symmetries,  and  each  discrete  symmetry  will  reduce  the 
dimension  of  the  operator  space  by  1.  For  large  supports,  it  is  clear  that  there  arc  many 
candidate  operators.)  The  second  derivative  in  the  calculus  or  several  variables  is  the 
Hessian,  which  is  a  matrix.  Its  algebraic  invariants  are  the  geometric  invariants  of  the 
original  function  viewed  as  a  surface.  Various  combinations  of  its  components  (taken 
linearly  and  nonlincarly)  can  be  used  as  2nd  derivative  operators.  If  an  edge  is  sought 
at  suitably  defined  maxima  of  the  gradient,  then  for  a  2nd  derivative  operator,  one  seeks 
zero-crossings.  [Marr  and  Hildreth  1979]  use  an  approximation  to  the  Laplacian,  which  is 
the  trace  of  the  Hessian.  (I)rcschlcr  and  Nagel  1981a,  Dreschler  and  Nagel  1981b]  use  the 
determinant  of  the  Hessian.  [Reaudct  1978]  computes  rotationally  invariant  derivatives 
up  to  4Lh  order.  [Canny  1988]  Lakes  an  optimal  estimation  approach  to  the  zero-crossing 
of  2nd  derivative  problem  of  [Marr  and  Hildreth  1979],  using  criteria  or  detectability  and 
localisation  in  a  variational  formulation. 

Approximation  and  representation  of  image  function 

One  of  the  drawbacks  or  the  methods  we  have  been  describing  is  that  a  very  few 
parameters  arc  derived  by  some  kind  of  local  projection.  The  parameters  arc  chosen 
for  semantic  interest,  but  while  they  respond  well  to  intended  features,  the  same  is  often 
true  for  unintended  features.  We  have  the  following  situation.  Let  X  be  the  space  of  all 
local  images,  and  F  Q  X  the  features  one  is  seeking.  Perhaps  this  is  done  by  some  map 
<p  :  X  -*■  R.  One  designs  this  map  so  that  <p(F)  >  0,  for  some  threshold  0,  and  one 


A  Survey  of  Edge  Detection 


Introduction  19 


would  like  to  be  able  to*infer  that  if  p(z)  >  6,  then  x  €  F.  Clearly,  to  do  this,  one  must 
have  some  information  about  p-1(— oo,  0),  but  this  is  surprisingly  often  neglected. 

Another  way  to  think  of  this  is  that  the  few  semantically  derived  parameters  actually 
do  not  provide  enough  information  to  understand  the  structure  of  the  image  intensity 
function,  even  locally.  Now,  of  course  the  pixel  values  constitute  complete  information, 
but  it  is  not  directly  usable.  One  approach,  then,  is  to  seek  a  local  representation  for 
the  image  data  which  is  appropriate  for  the  questions  one  wishes  to  resolve  with  further 
processing.  Approximating  the  Ilcssian  is  such  a  process,  since  that  can  be  regarded  as 
finding  the  best  local  quadratic  approximation,  just  as  computing  a  gradient  can  be  viewed 
as  finding  a  planar  approximation.  (Prewitt  1070]  computed  her  gradient  parameters 
based  on  a  planar  fit.  In  the  same  vein,  (Haralick  1080]  fits  planes  to  the  data  and  defines 
edges  as  boundaries  between  maximal  domains  of  fit,  relative  to  an  error  measure.  Planar 
fitting  is  very  crude  so  he  [Haralick  1081]  proposes  polynomial  fitting  as  an  extension. 
[Beaudct  1078]  is  motivated  by  fitting  a  truncated  Taylor  series,  though  the  semantics 
he  ascribes  to  his  operators  arc  somewhat  naive.  [Hsu,  Mundy,  Beaudct  1078]  use  a 
quadratic  fit,  based  on  Bcaudct’s  techniques.  [Altos  1075]  is  put  forward  as  essentially  a 
spline  fit. 

Global  edge  detection 

The  Hough  transform  (Hough  1062],  [Duda  and  Hurl  1971,  Duda  and  Hart  1072,  Duda 
and  Hart  1073]  is  a  technique  to  find  collincar  sets  of  feature  points  over  an  entire  image. 
This  can  be  applied  in  complete  globnlily,  i.c.  over  the  entire  image  at  once.  [Ballard  and 
Sklansky  1076]  [Shapiro  107'!,  Shapiro  1075,  Shapiro  1078],  and  others  use  generalisations 
of  the  method  to  look  for  other  1  dimensional  objects.  * 

Frequently,  the  term  linking  is  used  synonymously  with  global  edge  detection.  Linking 
consists  of  making  lists  of  local  edge  elements  connected  head  to  toe,  each  list  correspond' 
ing  to  an  extended  (global)  edge.  This  is  the  most  common  global  edge  detection  method, 
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dating  back  at  least  to#  (Roberts  1963],  and  including  many  others  (e.g.  [Horn  1972), 
[Binford  1970],  [Nevatia  and  Babu  1978]).  These  methods  differ  primarily  in  the  predi¬ 
cates  used  to  determine  whether  to  join  a  particular  edgel  into  a  contour.  A  major 
difficulty  stems  from  the  fact  that  the  linking  proceeds  only  after  irreversible  decisions 
are  made  about  local  edges,  e.g.  limiting  each  pixel  to  having  an  edgel  of  unique  orien¬ 
tation,  or  making  a  binary  decision  about  the  presence  of  a  local  edge.  The  type  of 
information  available  to  the  linking,  which  generally  proceeds  locally,  is  inadequate  for 
many  situations. 

An  improvement  on  the  linking  method  is  advanced  by  {Montanari  1970,  Montanari  1971] 
and  [Marlclli  1972,  Martclli  1973].  Here  the  prior  local  commitment  is  less  extreme,  and 
dynamic  programming  or  heuristic  graph  search  methods  are  used  to  find  optimal  paths 
with  respect  to  some  figure  of  merit.  The  figure  of  merit,  a  global  parameter,  replaces  the 
local  predicate  as  the  contour  selection  method,  and  likewise  as  the  main  artistic  element. 

The  “relaxation”  methods  propounded  by  [Zuckcr,  Hummel,  Roscnfeld  1977]  and 
(Roscnfcld,  Hummel,  Zuckcr  1975]  attempt  to  find  the  contours  globally,  in  parallel, 
and  without  excessive  initial  commitment.  The  process  depends  on  a  local  pairwise 
rcinforccmcnt-inhibiuon  process  between  edgcls.  The  art  is  in  choosing  the  reinforcement 
process.  Explicit  global  edges  arc  not  produced,  but  presumably  the  process  terminates 
with  acts  of  edge  points  which  arc  both  connected  and  or  a  desired  minimum  length, 
which  are  then  readily  identified. 

Region  growing 

We  motivated  edge  detection  as  a  means  to  region  finding.  Why  not  just  find  the  regions 
directly?  Many  people  have  tried  doing  just  that.  The  advantage  is  that  one  is  dealing 
with  a  global  object,  so  the  problem  of  linking  is  (or  seems  to  be)  avoided.  Rather  than 
deciding  whether  an  edge  separates  2  points,  one  must  decide  whether  2  points  belong  to 
the  same  region.  Seen  thus,  the  difference  is  mainly  one  of  (linguistic)  semantics.  The 
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data  structures  reflect  v^gions,  not  edges,  as  do  the  algorithms.  Consequently,  despite  the 
conceptual  equivalence  with  edge  finding,  different  approaches,  harder  to  express  in  the 
edge  detection  paradigm,  are  developed.  The  simplest  method  is  based  on  segmentation 
simply  by  intensity  or  color  value.  (Brice  and  Fennema  1970,  Fennema  and  Brice  1970] 
take  this  as  their  starting  point,  and  then  try  heuristics  to  clean  up.  [Ohlander  1975] 
segments  based  on  dividing  bimodal  histograms  of  several  color  parameters.  [Shafer  1980] 
builds  on  Ohlander’s  work.  [Somerville  and  Mundy  1976]  use  a  technique  based  on  more 
sophisticated  reasoning.  They  grow  regions  based  on  the  uniformity  of  an  approximation 
to  the  normal  to  the  image  intensity  function.  [Kirsch  1971]  defines  regions  based  on 
thresholding  a  "contrast”  (gradient)  function. 

In  the  following,  I  have  attempted  to  provide  a  critical  guide  to  the  literature  in  seg¬ 
mentation.  The  list  of  works  reported  on  is  by  no  means  exhaustive,  but  it  is  intended 
to  include  the  most  influential  works  as  well  as  some  others  representative  of  the  field. 
In  addition  to  summarizing  each  work,  I  have  usually  tried  to  put  it  into  some  perspec¬ 
tive,  which  is  to  say  that  1  have  included  many  of  my  own  reflections.  I  hope  that  the 
boundaries  between  the  two  arc  discernible  enough. 
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General  Works 

Prewitt  1970 

* Object  Enhancement  and  Extraction m 

The  paper  is  concerned  with  the  entire  image  understanding  problem: 

•  image  Formation 

•  image  restoration 

•  enhancement  (including  edge  enhancement) 

•  object  extraction 

The  author  provides  a  Fairly  extensive  bibliography  (237  rcFcrcnccs)  of  literature  at  that 
time  (much  oF  which  is  still  germane). 

The  work  is  Fairly  sophisticated  mathematically.  J2.g.,  Prewitt  considers  the  Laplace, 
Mellin,  Fourier,  and  llankcl  transforms,  moments,  Ilaar-Walsli  functions  (cr.  [O’Gorman 
1976J),  Chcbyshcv  polynomials,  point  spread  function  (PSF),  line  spread  Function  (LSK), 
edge  spread  Function  (I2SF),  modulation  transfer  Function  (MTF),  and  phase  transFer 
Function  (PTF).  She  also  discusses  resolving  power  and  restoration,  including  “super- 
resolution”  For  restoring  images  which  have  been  degraded  by  a  convolution  (rcFcrcncing 
e.g.  [Slcpian  and  Poliak  1961,  Landau  and  Poliak  1961,  Landau  and  Poliak  1962]  and 
applications). 

Edge  enhancement 

A  section  devoted  to  edge  enhancement  discusses  the  gradient,  generalized  derivative, 
Laplacian,  and  discrete  approximations  to  gradient. 


A  Survey  of  Edge  Detection 


General  Works  23 


As  one  means  of  obtaiiyng  an  estimate  of  the  gradient,  she  introduces  the  3X3  (now 
so-called),  “Prewitt  operator:" 

10-1  111 
10-1  0  0  0 
1  0  -1  -1  -1  -1 

This  is  used  in  a  method  of  estimating  the  gradient  by  fitting  a  quadratic  surface  to  data 
on  a  3x3  square.  The  masks  give  djdx,  d/dy  for  that  surface  directly  from  the  data. 
This  is  exactly  the  method  used  by  (llaralick  1980]  for  facets.  Similarly,  one  can  use  the 
same  idea  for  a  \  X  4  fit  or  a  Laplacian. 

She  also  discusses  oriented  edge  masks,  e.g. 

1  1  1 
1-2  1 
-1  -1  -1 

as  approximations  to  the  gradient  (“compass  gradient"),  and  gives  some  examples  of  their 
use. 

A  discussion  or  modified  “crispcning”  (Laplacian)  operators  is  presented,  as  well  as  ofline 
enhancers  (which  arc  basically  templates,  i.c.  matched  filters). 

Frequency  filtering 

Low,  high,  and  band  pass  filtering  is  considered. 

She  discusses  templates,  matched  filtering,  and  cross  correlation  for  feature  detection. 

A  good  discussion  or  thresholding  is  presented. 

The  paper  is  an  excellent  overall  survey  of  the  then-existing  methods  for  feature  extrac¬ 
tion,  and  in  particular  edge  detection.  By  and  large,  the  intervening  years  havc.sccn  only 
minor  improvements,  so  the  analysis  she  presents  is  still  relevant  today. 
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Evaluation 

Aside  from  frequency  domain  filtering,  the  methods  presented,  including  the  “Prewitt” 
operator,  are  completely  local  with  small  support — in  her  words,  “context-insensitive.” 
Consequently,  global  structures  cannot  contribute  to  the  edge  finding  process,  and  the 
derived  image  description  is  limited  to  1  or  2  local  parameters  which  provide  inadequate 
description  of  the  image  intensity  function  for  all  but  especially  regular  images. 

Unlike  most  gradient  estimation  or  template  matching  operators,  the  Prewitt  operator 
is  based  on  a  well-defined  process- - the  best  fit  of  a  plane.  The  gradient  by  itself  is  not 
sufficient  for  edge  detection,  since  no  discrimination  is  made  between  smooth  and  abrupt 
transitions,  although  plane  fits  can  be  used  in  more  sophisticated  ways  (sec  e.g.  [Haralick 
1980]). 

Davis  1973 

"A  Survey  of  Edge  Detection  Teehniquee" 

The  author  presents  some  discussions  of  prior  edge  detection  techniques: 

Parallel  edge  detection 

llcrskovits  and  [Jin  ford  1970 

linear  os.  nonlinear  operators  (nonlinear:  mainly  Rosen  Told,  Iluimncl,  Z  ticker  1975) 
texture  edges 

Griffith  1970,  Griffith  1973a,  Griffith  1973b 
llucckcl  1971,  I  fucckcl  1973 
Chow  and  Kancko  1972 

Sequential  edge  detection 
Martclli  1972  Montanari  1970 


Guided”  (top-down)  edge  detection 
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Kelly  1971 

Harlow  and  Eiscnbcis  1973 
Shirai  1973 

He  discusses  and  criticizes  what  was  done  very  tersely.  There  are  no  particularly  deep  or 
sophisticated  analyses;  nevertheless  this  work  provides  a  useful  first  tour  or  refresher  of 
some  of  the  more  significant  work  in  the  field.  One  can  detect  a  subtle  and  not  surprising 
bias  toward  Uoscnfeldism. 

Kanade  1978 

* Region  Segmentation:  Signal  vs.  Semantics " 

A  survey  of  image  segmentation  is  presented,  based  on  the  paradigm:  Image  Picture 
Domain  Cues  -»  Scene  Domain  Cues  -♦  Model  Instantiated  Model  — ►  View  Sketch  — ► 
Image  . . . ,  which  may  be  iterated.  A  distinction  is  made  among  the  categories  of  signal, 
physical,  and  semantic  knowledge. 

A  large  number  of  works  arc  brielly  surveyed,  and  categorized  according  to  how  they  fit 
into  the  above  paradigm.  For  example,  many  methods  use  only  signal  level  knowledge, 
and  hence,  in  this  paradigm,  can  provide  at  most  a  segmentation  based  on  picture  domain 
cues. 

Evaluation 

The  paradigm  presented  can  be  more  conventionally  summarized  jus  saying  that  one’s 
goal  must  be  to  infer  the  3-dimcnsional  structure  or  a  scene  in  order  to  model  the  scene 
and  understand  the  image.  Furthermore,  one  must  use  physical  knowledge,  c.g.  imaging 
physics  and  geometry,  to  make  this  inference.  This  is  hardly  new  or  controversial.  What 
is  debatable,  however,  is  the  distinction  which  is  made  between  picture  and  scene  domain 
cues.  The  main  orientation  of  the  paper  is  toward  region  growing  and  splitting  methods, 
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osing  fairly  primitive  “signal  level  knowledge,”  e.g.  histograms  of  the  image  gray  values. 
For  these  types  of  systems,  the  image-picture-scene-model  division  is  clear  and  seems 
natural.  But  for  “image  understanding”  in  general  (which  the  author  is  addressing),  such 

•  14  • 

in  easy  description  does  not  seem  justified,  and  no  arguments  are  presented  to  persuade 
the  reader,  though  in  the  author’s  defense  it  must  be  said  that  there  were  severe  space 
limitations  for  a  fairly  broad  article. 

It  seems  reasonable  that  the  first  step  in  image  understanding  might  well  be  to  compute 
a  description  of  the  image  data  in  a  more  useful  representation,  or  set  of  representations, 
than  is  provided  by  the  standard  one,  i.c.  the  set  of  pixel  values.  Kanadc  notes,  in  fact, 
that  (Favlidis  1972]  defines  segmentation  as  a  process  for  describing  the  image  features 
themselves.  From  this  point  of  view,  “picture  cues”  are  features  of  this  re- representation. 
(Kanadc  takes  a  more  restrictive  and  ill-defined  view;  he  defines  “picture  cues”  by  the 
examples:  line  segments,  homogeneous  regions,  and  intensity  gradient.  The  last  of  these 
is  properly  a  property  of  the  image,  but  it  can  be  argued  that  the  first  two  generally 
cannot  bo  extracted  reliably  without  using  knowledge  about  3-dimcnsional  structures,  and 
that  is  tantamount  to  making  inferences  about  the  “scene  domain,”  although  admittedly, 
historically  such  inferences  are  implicit.)  But  it  is  not  so  obvious  that  there  must  be  a 
trichotomy:  picturc-sccne-modcl.  First,  the  new  image  representation  is  chosen  based 
on  physical  knowledge  the  knowledge  that  determines  for  what  it  is  important  to  look. 
Whatever  features  arc  focused  on  in  analyzing  the  new  image  representation  arc  likely  to 
be  intcrprctable  as  features  in  the  scene  domain  only  in  conjunction  with  fitting  them  into 
a  model.  For  example,  the  interpretation  of  a  narrow  gradient-shaded  region  may  depend 
on  its  connection  to  other  regions  and  on  some  set  of  hypotheses  about  other  regions  in 
the  vicinity.  This  might  even  be  on  the  level  of  deciding  whether  the  region  is  an  object 
limb,  a  surface,  a  highlight,  or  even  whether  it  should  be  regarded  as  a  separate  region  at 
all.  One  can  readily  envision  a  Waltz  or  Zuckcr  type  relaxation  process  occurring  using 
the  semantic  relations  of  a  model  to  interpret  part  of  an  image  representation  as  a  scene 
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ich  expansions  cannot'avoid  truncation  error.  However,  if  the  sampling  kernel  is  taken 
i  (say)  C°°,  its  linear  combinations  (i.e.  the  functions  £  where  the  fa  are  discrete 
anslations  of  the  sampling  kernel  rj>)  become  prime  candidates  for  an  expansion  series, 
hese  can  be  frequency  (or  scquency)  ordered.  Expansion  in  terms  of  such  functions  has 
een  extensively  studied;  use  of  truncated  scries  fitting  is  worth  investigating. 

Wner  1974 

Computer  Perception  of  Curved  Objects  Using  a  Television  Camera 9 

ft;  arc  concerned  here  only  with  the  edge  finding  aspects  of  this  work 

'he  author  gives  a  brief  critical  synopsis  of  earlier  line  finding  work: 

Binford-IIorn  [Horn  1972] 

Griflilh  1970 

Ilcrskovits  and  Bin  ford  1970 
Ilough  1962 

Ilucckcl  1971,  llucckcl  1973 
Kelly  1971 
Murphy  I960 

O ’Gorman  and  Clowes  1976 
1 ’ingle  1966 

Pinglc  and  Tcncnbaum  1971 
Roberts  1963 
Shirai  1973 
Tcncnbauin  1970 

'he  edge  finder  Turner  employs  is  very  simple,  using  the  first  difference  of  adjacent  pixels, 
ollowcd  by  thinning,  and  farther  by  a  local  tracker  (inchworm). 

t  short  review  of  curve  segmentation  is  provided. 
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assumes  that  the  digitization  process  takes  place  by  averaging  over  a  square  pixel  sized 
window,  i.e. 


where 

/  =image  irradiance 

Pij  =unit  2-dimcnsional  pulse  at  the  point  (» ,j) 

gtJ  =the  sampled  output, 

then  the  Pij  constitute  an  orthonormal  set  whose  span  is  identical  to  the  Walsh  functions 
of  order  less  than  /  •  J  (where  f,J  are  the  cardinalities  of  the  i,j  sets).  The  higher 
order  Walsh  functions  describe  exactly  only  what  goes  on  within  pixels,  which  is  precisely 
the  information  lost  in  the  digitization  process,  so  one  has  a  perfect  match  or  model 
to  data.  The  Walsh  basis  diiTers  from  the  single  pixel  basis  most  notably  because  the 
support  is  spread  over  the  entire  region  of  interest,  i.c.  the  Walsh  basis  has  global  aupporL 
Truncating  the  scries  therefore  results  in  global  degradation,  rather  than  local  as  would 
be  the  case  with  the  analogous  action  of  leaving  off  some  set  of  pixels. 

Unfortunately,  incorporating  sufficient  Walsh  terms  to  utilize  all  the  picture  data  is 
equivalent  to  loing  a  (it  of  a  perfect  edge  to  the  sampled  data  with  the  pixel  value  = 
average  intensity  assumption.  This  becomes  extremely  complex  as  the  number  of  pixels 
increases,  and  if  lateral  displacements  of  edges  arc  permitted,  since  the  discontinuous  pulse 
convolution  kernel  forces  independent  examination  of  numerous  cases  corresponding  to 
the  edge  configurations’  relations  to  corners  of  pixels.  O’Gorman  already  has  to  consider 
2  such  cases  for  a  4x4  operator  and  8  Walsh  functions.  As  the  space  grows  larger,  so  docs 
the  complexity,  so  that  (Abdou  1978]  chooses  to  do  an  exhaustive  search  as  his  method 
of  fit. 

The  advantage  of  everywhere  differentiable  functions  (such  as  [Ilucckcl  1971,  Ilueckel 
1973]  uses)  is  that  the  lack  of  discontinuity  permits  a  single  set  of  equations  to  express 
the  optimization  problem.  Of  course,  if  one  assumes  a  discontinuous  sampling  kernel, 
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the  point  spread  function,  one  would  have  to  preserve  this  property  while  generalising 
the  integration  properties,  which  is  not  at  all  trivial.  What  comes  to  mind  is  using  tensor 
products  of  the  1-dimcnsionai  functions,  which  could  be  expected  to  have  similar 
properties,  but  that  would  not  lead  to  a  simple  description  of  boundaries. 

The  method  of  locating  the  free  knots  is  not  very  clearly  presented,  and  appears  to  be 
based  on  a  possible  misconception.  In  2  dimensions  the  problem  is  of  course  more  difficult 
because  of  the  complexity  of  the  boundary  space.  There  is  no  obvious  way  to  solve  this 
problem. 

The  paper  is  more  biology  oriented  than  computer  oriented,  so  understandably  no  con¬ 
sideration  is  given  to  digital  processing  issues,  the  most  important  of  which  is  the  effect 
of  discrete  sampling  on  a  periodic  grid. 

O'Gormsua  1976 

* Edge  Detection  using  Walsh  Functions " 

O’Gorman  shows  that  finding  edge  direction  by  fitting  a  plane  and  then  taking  its  gradient 
direction  is  subject  to  systematic  error  for  perfect  step  edges  centered  in  a  square  window. 
However,  this  is  a  consequence  or  the  shape  of  the  window  a  circular  window  would  not 
have  the  same  problem.  Nevertheless,  the  analysis  is  salient  because  pictures  arc  sampled 
on  a  square  grid  and  rectangular  operators  arc  common. 

lie  uses  the  2-dimcnsional  Walsh  functions  (tensor  products  of  square  waves)  as  an 
orlhonormal  basis  for  representing  the  image  function.  In  analogy  to  [llucckcl  1971, 
Ilucckcl  1973]  he  docs  an  L 2  (least  squares)  fit  of  a  perfect  edge  on  the  first  6  terms  (in 
his  Walsh  expansion). 

The  contribution  of  this  idea  derives  from  the  fact  that  the  Walsh  basis  bears  a  simple 
relationship  to  the  digitisation  process  (if  one  assumes  square  pixels).  In  particular,  if  one 
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It  is  interesting  to  compare  these  with  the  difference  of  Gaussians  suggested  by  Marr  and 
Ilildrcth  based  on  similar  assumptions. 

Some  comparisons  with  human  vision  are  made,  notably  between  line  spread  functions, 
which  arc  shown  to  be  similar. 

Evaluation 

The  work  is  quite  provoking.  The  most  interesting  features  arc  that  it  incorporates 
the  transducer  transfer  function,  models  images  as  intrinsically  disconlinous  objects 
in  a  coherent  way,  and  uses  statistical  estimation  for  detection.  Unfortunately,  the 
generalization  to  2  dimensions  is  not  easy,  and  probably  not  as  easy  as  Altcs  seems  to 
suggest.  He  proposes  2  routes  of  “generalization."  The  more  straightforward  involves 
using  rasters  at  a  number  of  angles.  Though  this  is  not  as  satisfying  as  an  intrinsically 
2-dimcnsional  approach,  it  may  be  a  viable  way  to  proceed.  Significant  problems  that 
would  have  to  be  overcome  include  integrating  all  the  information  from  the  various  scan 
lines  (which  could  be  argued  to  be  99%  of  the  problem  to  begin  with),  and  accounting 
for  or  using  a  2-dimcnsional  transducer  transfer  function.  Making  a  true  generalization 
to  2  dimensions  poses  the  following  difficulties.  Knots  are  of  codimension  1;  i.c.  they  are 
boundaries  between  regions,  so  on  a  space  of  I  dimension,  a  knot  is  O-dimcnsional,  or 
a  point.  Hut  on  a  space  of  2  dimensions,  a  knot  is  the  boundary  of  a  region,  i.c.  some 
curve,  a  1-dimcnsional  object.  So  for  1  dimension,  the  space  of  knots  is  (-dimensional 
(since  it  is  the  space  of  points),  but  for  2  dimensions  the  space  of  boundaries  is  infinite 
dimensional  (since  it  is  a  space  of  curves).  The  approach  or  workers  in  spline  theory  has 
been  to  generalize  the  intervals  between  knots  to  projections  from  higher  dimensional 
simpliccs,  leading  in  the  2-dimcnsional  case  to  piecewise  straight  boundaries,  but  this 
seems  to  be  inadequate  for  a  natural  description  of  the  boundary.  The  2-dimcnsional 
analog  or  the  delta  function  at  a  knot  is  a  delta  function  whose  support  is  a  boundary. 
Since  the  main  virtue  of  using  the  expansion  is  the  simplicity  of  convolution  with 
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where  is  defined  Analogously  to  6^~n\  since  u  *  6^  ")  =  «(“**). 

Working  in  the  Fourier  domain,  he  introduces  a  derived  set  of  normalized  basis  functions, 
and  shows  how  to  estimate  the  coefficients  in  the  expansion  in  the  case  of  a  single  knot  of 
known  position.  For  multiple  free  knots,  he  proposes  using  techniques  from  detection  and 
estimation  theory,  based  on  a  statistical  model  of  knot  location.  However,  the  approach 
is  predicated  on  the  use  of  a  matched  filter  to  locate  the  knots,  which  appears  to  be 
doomed  to  failure  because  the  basis  is  not  orthogonal. 

The  core  of  the  author’s  method  uses  filters  to  estimate  coefficients  or  detect  complex 
patterns.  Based  on  filter  complexity  considerations,  he  argues  that  these  fillers  should 
all  have  approximately  equal  space-bandwidth  products.  These  arguments  are  related 
to  implementation  issues,  and  for  the  digital  case  would  be  related  to  cost.  One  must 
keep  in  mind,  however,  that  a  major  consideration  of  the  work  is  a  theory  of  human 
vision.  In  order  to  achieve  a  set  of  filters  with  the  desired  property,  he  seeks  a  set 
of  transducer  transfer  functions  to  incorporate  into  the  imaging  transfer  function  U. 
Although  it  is  not  stated  in  the  paper,  one  can  think  or  this  as  a  convolution  preprocessor 
which  allows  further  processing  to  be  done  by  filters  all  having  the  same  space- band  width 
product,  lie  uses  one  particular  way  of  obtaining  a  constant  space- bandwidth  product, 
viz.  Fn(w)  =  anFn_|(ta;)  for  all  n  with  a  fixed  constant  k  >  1,  where  <*„  is  an  arbitrary 
proportionality  constant  and 


u(«W«)n 


where  ||-||  signifies  the  t?  norm.  Although  this  is  a  simple  way  to  get  a  constant  space- 
bandwidth  product,  it  is  not  the  only  way:  c.g.,  a  dilTercnt  k  could  be  used  for  each  n. 
In  any  case,  using  this  assumption,  he  arrives  at  a  set  of  log-normal  transducer  transfer 
functions,  i.c.  functions  of  the  form 


{/(w)  = 
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the  color  space.  From  this  point  of  view,  3)  is  not  really  possible  (unless  one  wants  to 
be  ad  hoc),  2)  is  unsophisticated  (though  it  may  be  adequate  in  many  cases,  but  won't 
maximize  S/N  ).  It  might  be  computationally  efficient  to  choose  not  a  basis,  but  a  larger 
spanning  set.  1)  therefore  is  the  way  to  go.  Note  that  there  may  be  more  than  one  metric 
which  is  worth  using  simultaneously.  It  is  also  worth  investigating  the  differences  between 
using  a  metric  such  as 

d(p>  9)  =  Up  -  Q\\  =  y/^iPi-Qi)7 

and  using  a  function 

p(p.9)  =  |  IIpII  -IMI|  =  \/Ep?  ~  n/Et?|- 

Notice  the  latter  is  like  the  intensity  difference. 

Altes  1975 

* Spline-like  Image  Analysts  with  a  Complexity  Constraint.  Similarities  to  Human  Vision" 

The  author  proposes  modelling  a  (1 -dimensional)  picture  as  a  finite  sum  of  basis  functions 
which  arc  integrals  of  delta  functions: 

/(*)=  E  E  w'i*-*-), 

n«=0  m*»0 

where  6^~n^  is  the  nth  integral  of  the  unit  Dirac  delta  function,  0  <  n  <  oo,  and  the  xm 
arc  free  knots.  Splines  can  be  viewed  as  such  sums  with  1  <  n  <  oo  and  smoothness 
conditions  imposed  at  the  knots,  hence  the  paper’s  title.  Including  the  point  spread 
function,  u,  of  the  imaging  system  yields 

/(*)  =  E  E  /nm«(-n)(*  -  *rn), 

n*»0  n»0 
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Nevatia  1077 

"A  Color  Edge  Detector  and  Its  Use  in  Scene  Segmentation * 

Nevatia’s  goal  in  this  work  is  to  define  a  Hueckcl  operator  for  the  3-color  domain. 

A  review  of  color  space  representations  is  presented. 

He  states  there  are  3  ways  to  look  for  color  edges: 

1)  Choose  a  metric  in  the  color  space  and  look  for  discontinuities 

2)  Choose  a  basis  and  look  for  edges  in  the  projection  to  each  basis  clement  separately 

3)  Do  2)  but  require  uniformity  to  use  3  components  together 
.  He  chooses  to  do  3). 

However,  what  he  actually  proposes  doing  is  minimizing  the  sum  of  the  squares  of  the 
errors  of  the  individual  color  component  Hueckcl  fits.  This  is  exactly  equivalent  to 
choosing  an  inner  product  on  the  color  space  such  that  the  3  color  components  arc  all 
orthogonal,  then  using  the  metric  inducer!  by  the  inner  product,  i.e.  the  Euclidean  metric. 
This,  ns  he  points  out,  is  equivalent  to  minimizing  the  individual  components  separately. 
Doing  so,  though,  would  lead  to  3  fits  for  the  3  components  which  might  have  nothing 
whatever  to  do  with  each  other  (since  one  is  not  looking  for  the  single  edge  that  leads 
to  all  the  data,  but  independent  edges  for  3  sets  of  data.  Therefore,  he  imposes  the 
additional  constraint  that  the  inclination  angles  Tor  all  3  solutions  must  be  the  same,  i.e. 
he  adds  the  2  equations  oq  =  aj  =  03.  However,  computing  this  angle  is  not  easy, 
so  instead  he  takes  a  weighted  average  of  the  3  independent  solutions  (i.e.,  without  the 
single  angle  constraint). 

The  idea  of  “best”  fit  implies  a  metric,  since  one  must  have  a  way  to  measure  how  good 
the  fit  is.  Hence  there  is  no  way  to  avoid  (explicitly  or  implicitly)  choosing  a  metric  for 
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Evaluation 

These  papers  contain  some  mathematical  inaccuracies,  which  in  themselves  arc  not  very 
important,  but  whose  presence  brings  into  question  other  mathematical  claims  which  are 
not  proven.  An  example  of  an  inaccuracy  is  the  statement  that  “The  set  of  all  continuous 
functions  over  [the  closed  unit  disk]  is  a  Hilbert  space.”  Since  a  Hilbert  space  is  defined 
to  be  a  complete  normed  inner  product  space,  the  statement  is  false  because  the  space  in 
question  is  complete  in  the  sup  norm,  where  there  is  no  inner  product,  but  not  complete 
in  the  inner  product  space  f>2,  which  is  the  one  llucckcl  is  using.  One  might  then  be  more 
skeptical  of  the  claim  that  the  basis  functions  he  settles  on  arc  the  unique  solutions  or 
some  unspecified  set  of  “functional  equations.” 

The  main  contribution  here  is  to  approach  the  best  edge  fit  problem  in  a  tractable 
subspacc,  thereby  transforming  an  essentially  combinatorial  problem  into  an  analytic  one. 
The  particular  implementation  of  that  idea,  however,  suffers  numerous  shortcomings. 

Several  criticisms  have  appeared  in  the  literature.  [Abdou  1978]  argues  that  the  trunca¬ 
tion  of  the  orthogonal  scries  introduces  excessive  error,  especially  for  thin  lines,  and  that 
unjustified  assumptions  arc  made  in  the  optimization  procedure.  [Shaw  1977,  Shaw  1979] 
makes  a  similar  criticism  of  the  optimization.  [Davis  1973]  complains  that  no  attempt  is 
made  to  relate  performance  to  the  image  noise  process. 

Experience  using  the  operator  shows  that  regions  of  smooth  shading  result  in  multiple 
firings,  while  regions  busier  than  the  size  of  the  operator  have  missed  edges  and  poor 
parameter  values.  These  failures  arc  a  consequence  of  using  a  poor  model  for  the 
underlying  image  intensity  function.  The  edge  and  edge-line  models  arc  unrealistic, 
especially  for  the  support  area  or  the  operator.  The  difficulty  can  be  traced  to  the  fact  that 
in  the  spaces  considered,  ideal  edges  and  linear  functions  arc  not  mutually  orthogonal. 

Unfortunately,  no  analysis  exists,  cither  here  or  elsewhere,  of  the  error  one  incurs  by 
using  such  simplistic  models. 
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to  minimise  </(**/,  *kEi,p)  with  respect  to  0,  p.  Since  the  basis  is  orthonormal,  this  can 
be  done  componentwise.  This  is  computationally  efficient  because  the  series  is  truncated 
at  a  point  which  allows  a  closed  form  solution  for  the  least  squares  problem.  The  line 
paper  uses  essentially  the  same  method,  with  additional  parameters  to  allow  for  an  ideal 
step  edge-line,  i.c.  a  sum  of  2  parallel  ideal  step  edges.  The  method  can  equivalently  be 
thought  of  as  fitting  the  best  function  from  the  fixed  Bubspace  S*  to  the  data  and  finding 
the  best  edge  fit  to  the  function.  (This  is  a  consequence  of  orthogonalities  of  various 
subspaccs). 

The  orthonormai  expansion  used  consists  of  polynomials  in  x,  y  with  a  uniform  radial 
weighting  function  y/l  —  x1  —  y 2.  For  the  edge  (old)  operator,  8  polynomials  up  to  degree 
3  are  used,  while  the  edge-line  (new)  uses  9  polynomials  up  to  degree  4  (neither  set  spans 
the  space  of  all  polynomials  up  to  their  maximum  degree).  What,  if  any,  classical  set  of 
orthogonal  polynomials  these  correspond  to  is  not  stated  and  not  immediately  evident, 
since  the  definition  of  the  basis  functions  is  presented  in  a  complex  way.  The  orthogonal 
functions  arc  related  to  a  Fourier- Bessel  basis,  since  x  =  r  cos 0,  y  =  r  sin  0,  and  the  r 
polynomials  can  be  thought  of  as  approximations  to  the  Bessel  functions  one  obtains  for  a 
radial  Fourier  transform.  It  is  not  staled  how  the  basis  functions  were  derived,  however. 

The  cdgc/no-cdgc  decision  is  based  on  the  “angle”  between  the  projections  of  the  data 
and  the  best  fit  edge  in  the  truncated  space  S*.  I.e.,  he  thresholds  on  the  value  of 

(*k/,*kEs,p) 

This  sulTcrs  from  the  common  problem  that  little  analysis  is  devoted  to  the  possible 
picture  functions  jr*  '(s^/S/p),  which  are  going  to  look  like  edges  to  this  operator.  In 
particular,  the  average  gradient  plays  a  large  role,  and  the  decision  criterion  therefore 
tends  to  respond  to  areas  with  large  average  gradients  over  the  support. 
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Evaluation  of  line  finding 

This  was  an  early  effort.  It  probably  is  not  bad  for  straight  lines,  though  it  seems  to  miss 
a  lot.  Curved  edges  or  complex  scenes  are  not  handled,  and  many  ad  hoc  methods  are 
used. 


The  technique  presented  here  has  no  hope  of  working  where  there  are  wide  variations 
in  smooth  shading  gradients,  since  the  thresholds  are  global,  and  the  gradient  operator 
cannot  discern  whether  the  signal  is  from  a  smooth  gradient  or  a  local  step. 

Of  course,  it  must  be  stressed  that  Roberts  broke  ground  in  the  use  of  his  gradient 
operator,  as  well  as  in  the  use  of  homogeneous  coordinates,  the  fitting  of  2-dimensional 
data  to  3-dimensional  models,  and  in  line  following. 

Hueckel  1969,  Hueckel  1971,  Hueckel  1973 

‘A  Local  Visual  Operator  Which  Recognize »  Edges  and  Lines" 

[Abdou  1978]  presents  a  detailed  analysis,  to  which  we  direct  the  reader  rather  than 
repeal  the  same  points. 

The  method  involves  finding  the  parameters  of  the  best  fitting  ideal  step  edge  in  a  disc-like 
region  of  32  to  137  pixels.  The  fitting  is  done  in  the  spirit  of  the  Rayleigh- Rite  method 
of  finding  approximate  solutions  to  variational  problems  (see,  c.g.  [Morse  and  Fcshbach 
1953]).  Using  a  fixed  orthonormal  basis  for  the  function  space  of  interest,  and  a  fixed 
truncation  of  the  orthonormal  basis,  he  finds  parameters  to  minimise  the  l?  distance 
between  the  projections  of  data  and  ideal  edge  in  the  finite  dimensional  space  spanned 
by  the  truncated  scries.  I.e.,  let  *  =  l,...,oo  be  an  orthonormal  basis  for  L a. 
Let  /  :  R!  ->  R  be  the  picture  (data).  Let  E*tP  be  an  ideal  step  edge  of  orientation 
0  centered  at  p  €  Rs.  Consider  the  space  5*  spanned  by  the  first  k  basis  vectors, 
^i, . . . , ip k i  and  let  x*  be  the  orthogonal  projection  onto  that  space.  Then  the  idea  is 
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•  correlate  (i.e.  sum), along  lines  of  length  =  5,  for  values  of  0  =  n  •  45°. 

«  threshold  on  the  ratio  ^  of  the  line  values,  yielding  edges. 

Linking 

•  connect  edgels  if: 

1)  they  lie  in  contiguous  4x4  squares. 

2)  they  are  related  by  a  <  23°  change  in  direction. 

•  eliminate  singletons. 

•  apply  an  ad  hoc  cleaning  processes  for  small  triangles,  quadrilaterals,  and  spurs. 

Curve  representation  and  segmentation 

•  least  squares  fit  straight  lines  to  linked  sets. 

•  uses  sequential  (updating)  method  of  (it. 

•  first  done  on  connected  edgels. 

•  choose  a  random  starting  point,  then  proceed  until: 

1)  a  branch  is  reached,  or 

2)  an  error  threshold  is  exceeded  for  the  line  fit,  in  which  case  back  up  until  thu 
local  angle  to  the  fit  is  cut  by  1/2. 


The  remainder  or  the  paper  is  concerned  with  the  recognition  and  display  or  polygonal 
3-dimensional  objects. 
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Local  edge  detector 

He  first  takes  the  square  roots  of  the  pixel  values,  on  the  basis  of  psychophysical  evidence 
which  he  cites.  In  a  2x2  pixel  window,  let  the  square  root  values  be 

a  b 
e  d 


The  edge  measure  is  then  defined  by 


<p  =  y/(a  —  d)2  +  (6  -  c)* 

This  is  proportional  to  the  gradient  magnitude  of  a  least  squares  fit  plane  (c.g.  [Haralick 
1980]).  I.c.,  if  F  is  the  best  fit  plane, 

|VF|  =  —  s/(o-d)*  +  (b-c)* 

y/2 

Roberts  cautions  that  his  line  finder  “makes  mistakes  in  complex  pictures  and  is  a  complex 
special- purpose  program  demonstrating  very  Tew  general  concepts.”  One  must  keep  in 
mind  that  this  was  a  pioneering  work  and  his  main  interest  was  higher  level  model 
matching. 

We  summarize  the  operations  performed  in  line  finding  in  the  following  lists. 

Edge  detection  process 

•  ♦  =  /?v(f’)  (do  “Roberts  cross"  operation,  i.c.  compute  |i/ro</|). 

•  take  max  on  each  A  X  A  square  of  a  tcssclation. 


•  threshold. 
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Local  Methods 

Beat  fit  techniques 

Roberta  1963 

“Machine  Perception  of  Three-Dimensional  Solids * 

This  is  a  seminal  work,  often  cited  as  the  first  serious  attempt  at  a  functioning  computer 
vision  system. 

The  research  described  seeks  to  match  pictures  of  a  narrow  class  of  prismatic  solids  to 
stored  models,  starting  from  raw  picture  data.  There  is  a  wide  range  of  issues  which  the 
author  had  to  address  to  achieve  this;  since  we  arc  concerned  here  with  segmentation,  we 
ignore  most  of  the  other  contributions  of  the  paper. 

The  central  task  the  program  performs  is  to  match  a  wire  frame  model  to  derived  wire 
frame  data.  An  important  part  of  this  consists  of  vertex  matching.  To  this  end,  lie  tries 
to  fit  n- point  data  (2-diincnsional)  to  an  n-poinl  model  (3-dimensional)  by  finding  the 
best  transforms  //,  I)  in  homogeneous  coordinates  such  that 


AH  =  DH  +  e, 

where 

A  =  n  points  ( x,y,z,w )  from  the  model 

If  =  n  points  ( y,z,w )  from  the  data  (uses  x  as  projection  axis) 
H  =  3X4  homogeneous  perspective  transform 
D  =  Diagonal  n  X  n  scale  matrix 
£  =  error  matrix 


lie  solves  this  as  a  least  squares  problem. 
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feature.  In  the  shape-from-shading  paradigm,  for  example,  one  is  hard  put  to  identify 
any  stage  as  "picture  domain  cues.” 

In  summary,  the  paradigm  presented  is  a  useful  one  for  discussing  extant  image  under¬ 
standing  systems,  and  is  particularly  clear  for  those  based  on  rudimentary  image  charac¬ 
teristics.  One  must  be  careful,  though,  not  to  be  misled  into  a  dogmatic  adherence  to 
the  paradigm  presented,  since  it  seems  likely,  perhaps  necessary,  that  it  is  inadequate 
as  a  description  of  the  type  of  system  required  to  do  successful  image  understanding 
in  unrestricted  environments.  The  survey  is  readily  accessible  as  well  as  concise;  it  is 
recommended  as  a  good  entry  into  a  fair  portion  of  the  segmentation  literature. 
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He  discusses  the  [p,  s)  representation  of  plane  curves,  defined  by 

p  =  tangent  direction,  and 
s  =  arc  length. 


Then 

dip /da  =  k  is  the  curvature,  and 

dp /da  =  constant  <=*  the  curve  of  ip  vs.  s  is  a  straight  line 
<p  —  a  linear  function  of  a. 

Curves  arc  then  found  by  fitting  straight  line  segments  to  the  (p,  s)  data. 

Abdou  1978 

“Quantitative  Methods  of  Edge  Detection * 

This  work  is  concerned  solely  with  local  operators. 

The  author  presents  a  review  of  several  such  operators: 

Roberts  1963 

Sobol  (l)uda  and  Hart  1973] 

Prewitt  1970 

Compass  gradient  (Prewitt  1970] 

Kirsch  1971 

3-lovcl,  5-lcvcl  (Robinson  1977] 
llucckcl  1971,  llucckcl  1973 

It  is  interesting  to  note,  perhaps  as  a  comment  on  the  literature  in  general,  that  Abdou 
presents  8  different  3X3  convolution  operators.  With  a  support  of  9  pixels,  there  can 
be  only  9  linearly  independent  3X3  operators  (since  they  make  a  9  dimensional  vector 
space).  The  8  presented  arc  in  fact  linearly  independent,  and  the  further  inclusion  of  an 
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operator  which  picks  oift  a  single  pixel  value  (the  trivial  operator),  e.g. 

0  0  0 
0  1  0 
0  0  0 

would  result  in  a  set  which  spans  the  entire  space. 

He  evaluates  the  performance  of  convolution  operators  on  perfect  step  edge  visual  input, 
assuming  square  pixels  and  area- proportional  sampling  (i.e.  the  pixel  value  g(p)  is  defined 
by 

g(p)  =  f. ,  tdA, 

JHip) 

where  R(p)  is  the  (square)  pixel  support  region).  This  leads  to  complicated  formulae  for 
pixel  values  from  rotated  edges. 

He  discusses  statistical  aspects  of  edge  detection  and  evaluates  the  2x2  and  3X3 
operators  with  respect  to  statistical  performance.  E.g.,  probabilities  of  detection  vs.  false 
detection  Tor  various  S/N  arc  compiled. 

A  discussion  of  edge  detection  as  pattern  classification  is  also  presented,  including  the 
application  of  the  llo-Kashyap  algorithm  to  the  problem. 

A  review  of  statistical  methods  is  presented,  focussing  on  various  methods  of  hypothesis 
testing: 

Bayes  decision  rule 
Ncyman-f’carson  criterion 
mini  max  criterion 

All  evaluations  arc  based  on  assuming  the  input  to  consist  of  a  perfect  step  edge  plus 
simple  (usually  Gaussian)  noise.  Unfortunately,  real  data  rarely  have  perfect  step  edges 
and  usually  have  non-constant  areas  which  arc  not  edges.  (Sec  e.g.  the  review. or  (Canny 
1083}  for  a  more  detailed  discussion  of  this  assumption.) 
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An  analysis  is  presented  of  the  effects  of  Gaussian  noise  for  linear  masks. 

Abdou  uses  Pratt’s  figure  of  merit  to  test  the  various  convolution  operators.  The  best 
performers  are  the  3-level  and  Prewitt  operators  (which  are  essentially  the  same).  (Pratt's* 
figure  of  merit  is  defined  as  follows.  The  input  is  a  perfect  vertical  ramp  edge,  i.e.  a 
function  only  of  x,  having  a  cross-section  of  constant-ramp-constant,  i.e.  a  constant  part 
connected  by  a  linear  part  to  another  constant  part.  The  variable  parameters  of  the  input 
are  the  contrast  (the  difference  between  the  2  constant  values),  the  slope  (of  the  linear 
transition  ramp),  and  the  standard  deviation  of  additive  Gaussian  noise.  The  figure  of 
merit  is  then  defined  by  a  formula  based  on  parameters  of  the  output  error.  Also,  an 
analogous  version  is  presented  for  edges  at  a  45°  angle  to  vertical.) 

For  convolutions  with  square  support,  he  analyzes  the  effects  or  mask  size,  center-weighted 
masks,  and  local  adaptive  thresholds. 

Abdou  proposes  2  new  edge  operators:  1-  and  2-diincnsional  ramp  best  fits,  reap.  The 
idea  for  the  1-dimensional  case  is  to  fit  an  ideal  I -dimensional  ramp  edge  to  the  data  for 
all  possible  ramp  sizes  (with  discrete  end  points).  Results  Tor  each  size  arc  given  in  closed 
form,  but  the  various  sizes  must  be  considered  separately  to  determine  the  best  among 
them.  The  2-dimcnsional  ramp  best  fit  proceeds  in  the  same  way  as  the  1 -dimensional, 
but  he  also  considers  all  possible  orientations.  These  he  limits  to  multiples  of  45°. 

There  arc  several  appendices: 

Analysis  of  the  lluccke!  operator  (fairly  good) 

Orthogonal  transformation  in  edge  detection 

(the  beginnings  or  a  DFT  method  or  edge  detection) 

The  licrskovils  algorithm  (not  a  very  enlightening  discussion) 

Derivations  of  I3qs.  3.29,  3.31,  3.32  (some  statistics) 

Experimental  results  (pictures)--  not  very  informative,  extensive  or  useful, 
lie  only  provides  binary  -edge  maps  of  3  pictures.  One  can’t  really 
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see  what  is  happening  locally  (the  pixels  are  too  small  to  be  seen). 

Evaluation 

To  the  extent  that  the  local  edge  ramp  hypothesis  is  valid,  the  ramp  fitting  method 
may  work,  though  it  is  essentially  equivalent  to  applying  various  slope  masks  in  various 
directions.  This  is  a  rather  inelegant  approach  since  a  best  fit  must  be  performed  for  each 
possible  ramp  width  and  angular  orientation,  with  the  optimum  found  by  exhaustively 
comparing  the  error  parameters  for  all  the  fits.  One  advantage  over  gradient  operators 
and  other  best  fit  operators  is  that  the  present  method  can  be  used  to  reject  regions  of 
smooth  shading  if  the  all-ramp  condition  is  rejected  as  not  an  edge.  The  main  virtues, 
then,  stem  from  the  enlargement  or  the  space  of  possible  features  to  include  the  ramp 
edges.  However,  the  Ilaralick  “facet”  model  is  more  general,  no  more  expensive,  more 
elegant,  and  probably  more  effective,  though  probably  also  inadequate  (see  review  of 
(ilaralick  1980]). 

Beaudet  1978 

* Rotationally  Invariant  Image  Operators " 

The  author  is  interested  in  finding  a  least  square  polynomial  approximation  to  image 
data.  The  coefficients  of  the  monomial  tcrniB  arc  computed  via  convolution. 

The  starting  point  is  to  consider  the  polynomial  to  be  fitted  as  a  truncated  Taylor  scries. 
The  coefficients  arc  found  as  in  a  normal  least  squares  problem,  but  arc  taken  to  represent 
the  derivatives  in  the  Taylor  expansion.  To  1st  order,  this  is  the  same  as  fitting  a  plane 
and  estimating  the  gradient.  The  quadratic  part  is  tantamount  to  finding  the  classical 
Hessian. 

Dcaudct  considers  operators  up  to  4th  order,  and  operator  sizes  from  3  X  3  to  8  X  8.  The 
only  rotationally  invariant  1st  order  operator  is  the  gradient,  or  rather,  more  precisely, 
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the  squared  magnitude  4>f  the  gradient,  V/-  V/.  The  2nd  order  operators  correspond  to 
the  linear  invariants  of  the  Hessian  matrix, 

"  =  'I’ 

\/*v  /*»/ 

as  well  as  the  scalar-valued  operators  |i/V/j  and  V/HV/. 


Unfortunately,  it  appears  that  the  author  confuses  the  Hessian  with  a  matrix  repre¬ 
sentation  sometimes  called  the  Wcingarten  map,  which  *ib  the  differential  of  the  Gauss 
map.  The  linear  invariants  of  the  Weingartcn  map  arc  the  intrinsic  curvatures  of  the  sur¬ 
face:  the  eigenvalues  are  the  principal  curvatures,  the  trace  is  the  mean  curvature,  and  the 
determinant  is  the  Gaussian  curvature.  The  author,  however  attributes  these  properties 
to  the  Hessian.  This  confusion  most  likely  stems  from  the  fact  that  the  two  coincide 
at  any  critical  point  of  the  function  /,  and  it  is  possible  to  rotate  the  3-dimcnsional 
coordinate  system  of  a  surface  in  Rs  so  that  any  given  point  is  a  critical  point  when  the 
surface  is  being  viewed  as  the  graph  of  a  function  from  R2  — »  R.  This  is  commonly 
done  in  expositions  of  the  subject  to  simplify  formulas.  However,  since  we  are  in  a  fixed 
coordinate  system,  such  a  simplification  is  not  possible  (without,  of  course,  including  the 
rotation  matrices).  (Sec,  c.g.  (do  Carmo  1976].)  The  differential  of  the  Gauss  map,  when 
the  surface  is  given  as  the  graph  of  a  function  /  :  R2  -*  R,  can  be  written  in  coordinates 


dN  - - ‘ - ('”  +  ^  ~M’\ 

u  +  /*  +  /S)3'aV/.v  ww./v  i  +  flJ 

which  is  easily  seen  to  reduce  to  the  Hessian  at  a  critical  point  of  /. 


Ileaudcl  correctly  points  out  that  the  trace  or  the  Hessian  is  the  Laplacian,  but  he  makes 
incorrect  assertions  about  the  relations  between  the  quantities  he  derives  from  the  Hessian 
and  various  curvatures. 


Three  3rd  order  operators  arc  presented,  which  arc  claimed  to  have  significance  as  tine 
end,  curve  boundary,  and  line  detectors. 
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The  above  terminology'and  interpretation  is  mine;  he  presents  these  in  the  more  classical 
language  of  tensors  and  coordinates,  where  his  operators  are  contractions  of  tensors. 

One  should  note  that  considerations  and  techniques  very  similar  to  those  presented  in 
this  paper  were  described  by  Prewitt  circa  10  years  before,  though  no  reference  is  made 
to  that  work. 

Evaluation 

The  experimental  results  consist  in  the  application  of  a  few  of  the  operators  to  a  single 
image.  Since  the  notions  of  line  detection  and  edge  detection  are  very  simplistic,  there 
is  no  effort  to  use  the  results  of  the  processing  in  any  way  other  than  to  present  the 
magnitude  of  the  operator  output.  Not  surprisingly,  this  is  not  very  effective.  However, 
more  sophisticated  processing  based  on  the  obtained  fit  is  promising.  A  potential  difficulty 
may  lie  in  the  manner  in  which  the  fit  is  obtained,  since  polynomial  least  squares  fits  tend 
to  produce  spurious  oscillations. 

Despite  these  shortcomings,  the  proposal  to  compute  geometrically  and  analytically 
significant  properties  of  the  image  intensity  function,  using  convolutions,  is  a  worthwhile 
contribution.  The  thrust,  perhaps  not  made  clear  by  the  author,  is  to  derive  an  under¬ 
standing  of  the  image  intensity  function  in  terms  which  have  precise,  well-understood 
meanings,  and  which  go  beyond  a  few  naTvciy  chosen  parameters.  As  it  happens,  the 
error  about  intrinsic  surface  properties  may  be  fortuitous,  since  it  may  make  more  sense 
to  consider  the  Hessian  of  the  intensity  function,  rather  than  its  surface  geometry  inde¬ 
pendent  of  coordinate  system.  There  is,  after  all,  a  special  coordinate  system  in  this 
situation:  intensity  (the  2-axis)  is  quite  dilTcrent  from  location  (x,  y),  and  so  there  is  no 
reason  to  expect  that  the  invariances  or  rotating  the  entire  3-diincnsional  space  should 
be  the  right  ones.  It  would  be  interesting  to  see  results  of  psychophysical  studies  where 
the  intensity  function  is  changed  so  that  only  the  Hessian  or  the  Wcingartcn  map,  but 
not  both,  change. 
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As  presented,  this  is  not'a  viable  edge  detection  method.  However,  the  idea  of  local  fitting 
merits  further  investigation,  particularly  in  regard  to  deriving  differential  operators. 

Hsu,  Mundy,  Beaudet  1978 
-  “Web  Representation  of  Image  Data * 

The  authors  arc  interested  in  using  a  local  quadratic  fit  to  detect  image  features.  A 
quadratic  polynomial  is  least  squares  (it  to  the  image  data  on  (presumably  uniform)  local 
neighborhoods.  The  polynomial  is  regarded  as  a  Taylor  series,  and  the  coefficients  arc 
interpreted  as  partial  derivatives  (sec  [Beaudet  1978]).  The  principal  axes  arc  identified, 
and  a  mesh  is  constructed  over  the  image  by  extending  straight  lines  along  these  special 
directions  until  some  error  threshold  is  reached,  resulting  in  a  new  mesh  node  and 
repetition  of  the  process.  The  implementation  is  based  on  starting  from  seed  nodes,  with 
special  rules  for  the  image  periphery,  propagating  down  and  right,  and  merging  or  nearby 
nodes.  Some  nodes  of  the  resulting  mesh  are  labelled  according  to  the  “curvatures"  and 
an  extremum  predicate..  Global  paths  through  the  mesh  arc  then  sought  by  the  use  of 
production  rules  based  on  the  local  labelling  to  follow  arcs.  It  is  not  entirely  clear  how 
this  process  works;  apparently  some  kind  of  relaxation  is  involved. 

Experimental  results 


Partial  results  are  shown  for  I  real  and  2  synthetic  images  of  ca.  128  X  128  resolution. 
Feature  finding  is  only  shown  Tor  two  of  these,  where  a  purported  ridge  is  found  in  a 
synthetic  normal  saddle,  and  some  ridges  arc  found  in  a  real  picture  or  scratches.  The 
performance  on  the  real  picture  is  quite  poor,  although  it  is  hard  to  isolate  the  reason. 
Probably  it  is  a  consequence  cither  of  the  extreme  coarseness  and  irregularity  of  the  mesh, 
or  the  local  ness  and  ingenuousness  of  the  production  rules. 
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Evaluation 

See  the  review  of  [Dcaudet  1978]  for  remarks  about  fitting  and  differential  operators. 
The  same  misconception  is  present  here  as  in  (Beaudet  1978]  regarding  use  of  the  Hessian 
to  define  principal  curvatures  and  intrinsic  surface  properties,  rather  than  the  correct 
expression  for  the  differential  of  the  Gauss  map.  Consequently,  the  “principal  axes’*  and 
“curvatures”  the  authors  find  correspond  to  the  conventional  usage  of  those  terms  only  at 
stationary  points  of  the  image  intensity  function.  However  (see  review  of  [Beaudet  1978]), 
these  objects  may  actually  be  more  meaningful  for  image  analysis  than  the  geometric 
invariants.  lC.g.,  [Canny  1983]  uses  essentially  the  same  parameters  in  his  directional 
operators. 

The  construction  of  the  mesh  is  a  good  idea  insofar  as  a  coordinate  system  based  on 
principal  directions  is  found.  However,  the  mesh  is  far  too  coarse,  and  the  method  of  its 
construction  leads  to  a  topology  which  may  not  have  much  to  do  with  the  underlying 
structure.  The  authors  apparently  wanted  a  graph  structure  to  propagate  their  produc¬ 
tion  rules  on,  but  unless  they  have  bugs,  what  they  got  was  more  or  less  a  mess.  The 
production  rule  technique  is  not  very  well  explained,  hence  difficult  to  evaluate,  but  the 
impression  one  gels  is  that  it  is  somewhat  inflexible,  c.g.  putting  limits  on  rotation  of 
principal  direction.  It  is  not  clear,  c.g.  how  the  production  system  performs  a  function 
separate  from  the  mesh  generation  itself,  where  error  criteria  are  also  imposed.  It  may 
be  that  using  a  finer  mesh  would  provide  much  improved  results. 

A  second  problem  is  that  no  analysis  is  given  regarding  noise  behavior.  A  big  question  is 
the  behavior  of  the  mesh  generation  in  the  presence  of  noise. 

Dreachler  and  Nagel  1981a,  Dreschler  and  Nagel  1981b 

“Volumetric  Model  and  SD-Trajectory  of  a  Moving  Car  Derived  from  Monocular  TV- 
Frame  Sequences  of  a  Street  Scene" 


A  Survey  of  Edge  Detection 


Local  Methods  48 


The  authors  are  primarfly  interested  in  tracking  objects  in  a  sequence  of  successive  static 
frames.  They  seek  point  features  which  arc  expected  to  be  stable  from  frame  to  frame, 
settling  on  extremal  points  of  the  Gaussian  curvature  of  the  intensity  function.  The 
computation  of  the  curvature  is  performed  via  “principal  curvatures”  using  the  operators 
of  Bcaudet  (which  in  fact  compute  something  other  than  principal  curvatures:  see  review 
of  (Beaudct  1978]). 

The  authors  arc  motivated  by  seeking  local  extrema  of  Gaussian  curvature.  However, 
they  found  that  such  extrema  occur  at  knees  of  edges  (cliffs  in  the  intensity  function) 
in  an  unstable  manner,  as  a  consequence  of  local  noise  and  small  variations.  Therefore, 
a  more  involved  predicate  is  used.  Viz.,  pairs  of  nearby  points  are  found  which  are  a 
maximum  and  a  minimum  of  Gaussian  curvature.  Along  the  line  joining  these  points, 
that  point  having  the  steepest  slope  of  intensity  (i.c.  directional  derivative)  is  selected  as 
the  feature  point,  subject  to  the  following  2  criteria.  First,  it  is  asserted  that  exactly  1 
principal  curvature  must  change  sign  along  the  line  in  question  (this  is  true  only  if  the 
extrema  of  Gaussian  curvature  arc  of  opposite  sign,  which  is  implicitly  assumed),  hence  it 
is  required  that  the  principal  direction  corresponding  to  the  principal  curvature  which  is 
changing  sign  be  roughly  parallel  to  the  line  in  question.  This  assumes  that  the  extrema  of 
the  Gaussian  curvature  should  be  joined  by  principal  curves,  a  proposition  whose  truth  is 
by  no  means  self-evident.  Secondly,  the  intensity  value  at  the  maximum  must  be  greater 
than  that  at  the  minimum.  This  is  for  the  case  that  the  high  intensity  area  is  convex 
at  the  corner.  Since  the  reverse  case  obtains  by  turning  the  surface  upside  down,  which 
docs  not  change  the  Gaussian  curvature  anywhere,  the  opposite  condition  must  be  true 
when  the  low  intensity  area  is  convex,  so  without  other  information  about  the  context  of 
the  extrema,  this  seems  to  be  a  vacuous  condition.  Also,  an  ad  hoc  maximum  separ  '.ion 
of  4  pixels  is  required  for  pairs  of  extrema  to  be  linked.  Obviously,  this  is  a  requirement 
that  the  corner  be  quite  sharp  at  the  resolution  of  the  image. 

Both  5X5  and  3X3  operators  arc  used:  the  5  X  5  for  good  noise  behavior,  and  the  3X3 
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Tor  better  resolution  in4 places  selected  by  the  5X5.  The  operators  used  are  the  ones 
presented  in  [Beaudet  1978].  Consequently,  the  present  authors  arc  victims  of  an  incorrect 
definition  of  Gaussian  curvature  (see  review  of  [Beaudet  1078])  and  principal  directions. 
However,  it  is  extrema  of  Gaussian  curvature  which  arc  of  interest.  The  relation  between 
these  and  what  is  actually  (erroneously)  used  is  algebraically  complicated,  and  we  do  not 
attempt  to  analyse  it,  but  these  parameters  may  be  just  as  meaningful  for  images  as  the 
geometric  ones.  Furthermore,  there  is  already  a  heuristic  clement  to  locating  the  points 
of  interest.  Therefore,  it  doesn’t  seem  likely  that  the  use  of  the  correct  values  of  the 
Gaussian  would  change  the  performance  significantly.  To  get  a  better  understanding  of 
the  situation,  one  should  in  fact  analyse  the  behavior  of  these  parameters  in  the  light  of 
what  is  known  about  the  image  irradiancc  equation. 

Experimental  results 

The  results  displayed  seem  to  be  fairly  good.  Of  course,  there  arc  a  number  of  other 
elements  of  the  system  we  arc  not  considering  here,  c.g.  the  method  of  tracking,  so  that 
it  is  difficult  to  say  how  reliably  the  features  selected  represented  intrinsic  features  of 
objects  or  even  of  the  intensity  function. 

Evaluation 

The  present  work  is  best  regarded  as  a  corner  detector.  As  such,  it  is  not  adequate 
for  performing  segmentation.  As  far  as  its  usefulness  for  matching  images  is  concerned, 
one  would  have  to  analyse  to  what  degree  extrema  of  Gaussian  curvature  arc  intrinsic 
features  of  the  object  geometry,  rather  than  the  intensity  surface  geometry.  There  arc  2 
components  to  such  a  study:  the  effects  of  perspective  transformation,  and  the  effects  of 
photometric  laws.  An  initial  approach  could  consider  these  components  separately,  i.e. 
constant  light  with  moving  observer,  and  fixed  observer  with  moving  light. source.  Since 
the  features  used  arc  picccwiBC  smooth  functions  of  the  parameters  of  motion  and  lighting, 
one  can  expect  that  they  will  trace  out  piecewise  smooth  curves  as  those  parameters  are 
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varied;  and  hence  they  Can  be  tracked.  Whether  they  are  good  things  to  track  is  another- 
question.  Consider  the  extreme  case  of  a  moving  flat  mirror,  moving  in  its  own  plane, 
and  reflecting  a  light  source.  This  isn’t  a  completely  ridiculous  case,  since  it  is  a  limiting 
case  of  what  can  happen  with  specularity,  which  in  turn  is  a  matter  of  degree  for  the 
reflectance  function.  The  point  to  note  is  that  the  feature  associated  with  the  specularity 
will  behave  as  a  function  of  the  location  of  the  light  source  rather  than  as  a  function  of 
the  motion  of  the  object  reflecting  it.  The  moral  is  that  the  behavior  of  a  feature  can  be 
highly  decoupled  from  that  of  the  object  whose  surface  creates  it.  A  less  extreme  example 
to  ponder  is  studied  by  [Kocndcrink  and  van  Doom  1980],  who  show  that  the  extrema  of 
the  image  intensity  stay  near  parabolic  lines  of  the  object  surface  (but  move  along  them). 

The  relevance  to  image  segmentation  is  this.  Principal  curvatures,  principal  directions, 
and  principal  curves  arc  useful  features  of  the  image  intensity  function.  They  define 
a  local  geometry,  and  notably  a  local  orthogonal  coordinate  system  which  is  a  natural 
coordinate  system  in  the  vicinity  of  edges.  Predicates  based  on  observation  of  the  behavior 
of  principal  curves  seem  good  candidates  for  edge  detection  and  hence  segmentation.  This 
work  at  least  shows  that  such  features  have  some  stability  in  the  presence  of  noise  and 
deformation. 

Haralick  1980 

"Edge  and  Region  Analysis  for  Digital  Image  Data * 

The  view  taken  here  is  that  edges  and  regions  can  be  viewed  as  places  where  there  are 
large  or  small  differences,  reap.,  in  some  parameters.  In  this  light,  the  old  method,  i.c. 
looking  Tor  perfect  step  edges,  amounts  to  fitting  a  piecewise  constant  function  to  the 
image  intensity.  The  new  method  which  the  author  puts  forth,  is  to  do  a  piecewise  linear 
fit,  i.c.  to  fit  planes  (or  facets).  The  work  is  purely  theoretical  in  that  real  images  are 
not  considered. 

The  central  feature  of  the  analysis  is  to  perform  a  least  squares  fit  of  a  plane  to  the  data. 
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The  author  provides  a 'nice  analysis  of  noise  for  this  problem.  The  critical  question  is 
whether  2  planar  patches  are  actually  part  of  the  same  plane:  the  edge  null  hypothesis. 

To  resolve  this  question,  he  uses  the  F-test  on  a  x2  distribution  derived  from  the  error. 

More  specifically,  the  way  this  is  used  is  as  follows.  Each  point  p  of  the  picture  is  assigned 
a  neighborhood  p*-+  Up  which  is  the  one  supporting  the  best  fit  among  all  Up  containing 
p.  I.e.,  of  all  neighborhoods  Up  such  that  p  6  Up,  let  Up  be  such  that  e(Up)  is  minimum. 

Edge  and  region  detection  are  then  based  on  an  F-test  of  the  parameters  associated  with 
the  optimal  neighborhoods  for  adjacent  pixels,  followed  by  thinning.  Even  neglecting  the 
piecewise  planarity  assumption,  this  adjaccnt-F-tcst  is  probably  too  simple  minded. 

The  technique  can  be  summarized  as  follows: 
edge  detection  method: 

each  pixel  has  a  best- fit  neighborhood  with  parameters  of  fit. 
cdgcncss  =  F  statistic  that  adjacent  pixels’  fits  come  from  same  plane, 
compute  for  vertical,  horizontal  adjacencies  for  vertical,  horizontal  edges, 
find  maxima  by  non-maximum  suppression. 

region  growing  method: 

group  adjacent  pixels  if  same  best  fit  neighborhood  plane  hypothesis  cannot  be  rejected. 

The  hypothesis  testing  is  based  on  the  relation  between  parameter  differences  and  errors 
of  fit.  If  the  local  is  relatively  poorer,  greater  parameter  differences  arc  tolerated  for 
region  merging.  In  this  sense,  the  region  merging  is  adaptive.  However,  no  analysis  is 
presented  describing  how  this  method  would  behave  for  large  regions  or  long  edges.  Also, 
no  attention  is  given  to  the  problem  of  determining  whether  local  edges  are  part  of  a 
larger  edge. 

The  author  includes  a  quick  but  nice  review  of  some  related  literature.  For  example,  he 
shows  that  the  “Roberts  cross”  operator  (Roberts  lOG.'l]  computes  the  magnitude  of  the 
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gradient  of  a  linear  fit  (although  this  is  all  but  explicitly  stated  in  [Prewitt  1970]). 
Unfortunately,  the  paper  includes  no  experimental  results  or  consideration  of  real  images. 

Evaluation 

The  idea  of  fitting  regions  and  looking  at  the  parameters  is  a  good  one.  Statistical  analysis 
is  good,  too.  However,  the  piecewise  planar  hypothesis  is  not  sophisticated  enough.  On 
the  other  hand,  the  statistics  becomes  more  complicated  for  more  complicated  fits.  In  the 
form  proposed,  this  method  is  not  likely  to  be  noticeably  better  than  other  local  methods. 
The  extended  edge  and  region  part  is  rather  ad  hoc — not  based  on  a  sound  analysis.  This 
paper  can  be  recommended  as  a  good  introduction  to  the  use  of  statistics  and  fitting, 
despite  some  ambiguities. 

Haralick  1981,  Haralick  1982,  Haralick  1984 

* Digital  Step  Edges  from  Zero  Crossing  of  Second  Directional  Derivatives " 

The  essential  feature  of  the  technique  proposed  by  the  author  is  fitting  the  image  intensity 
function  by  a  polynomial. 

lie  first  intuitively  defines  edges  as  discontinuities  in  brightness  value  or  its  “derivative." 
But  then  he  notes  that  for  this  to  make  sense,  the  discrete  picture  must  he  thought  of  as 
samples  of  a  function  on  a  continuum.  To  obtain  such  a  function  from  the  data,  he  does 
polynomial  approximation  using  discrete  orthogonal  polynomials.  “Discrete  orthogonal” 
means  orthogonal  with  respect  to  the  “inner  product" 

(/»»)  =  ]£  /(P)ff(p) 

p€P 

where  P  is  some  finite  set  of  points,  though  this  is  not  explicitly  stated.  It  is  not  a  true 
inner  product  because  it  can  happen  that  (/,  /)  =  0  with  /  7^  0.  (sec  c.g.,  [DeBoor 
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B]).  Regrettably,  he<provides  no  references:  there  is,  after  all,  a  rather  large  literature 
Gaining  to  fitting  polynomials. 

hin  this  context,  he  can  define  what  is  meant  by  “edge."  In  [Haralick  1981],  this  is 
ned  as  a  place  where  the  “direction  isotropic  magnitudes"  of  the  1st  or  2nd  partials  of 
fitted  function  exceed  some  threshold.  However,  he  requires  the  assumption  that  the 
ivatives  of  the  underlying  function  are  uniformly  bounded  except  at  discontinuities 
that  the  high  estimated  values  can  be  attributed  to  discontinuities).  This  is  not  very 
listic,  and  in  [Haralick  1982  and  Haralick  1981]  it  is  replaced  by  a  definition  of  “edge” 
a  zero  crossing  of  the  2nd  directional  derivative  in  the  gradient  direction  (see  review 
[Canny  1983]  for  a  more  detailed  definition  of  this  entity),  i.e.  a  maximum  of  the 
dient.  While  looking  at  the  parameters  of  a  fitted  function  is  a  good  approach,  this 
it'll!  too  local  a  criterion,  and  too  simplistic  a  structural  representation,  so  that  most 
the  benefits  of  surface  fitting  arc  lost,  as  demonstrated  by  [Canny  1983].  In  [Haralick, 
itson,  Laffcy  1983],  he  improves  this  considerably,  expanding  to  the  derivation  of  the 
dilative  structure  of  the  function. 

imposes  1-dimcnsional  symmetry  on  the  index  sets  of  the  polynomials,  i.e.  the  points 
which  they  arc  defined  must  be  symmetric  about  the  origin.  For  2-dimcnsional  basis 
ictions,  he  uses  the  tensor  product  of  his  1-dimcnsional  set.  lie  then  shows  how  to 
by  the  usual  method  of  projection  onto  the  orlhonorma]  basis.  A  further  section  of 
iralick  1981]  is  devoted  to  showing  that  D\  +  Dy  and  D\x  +  Dyy  are  rolationaily 
arianl  differential  operators. 

dilation 

c  idea  of  fitting  a  function  to  the  intensity  data  as  a  first  step  in  edge  finding  is  good, 
fiough  the  definition  of  “edge”  is  somewhat  simple-minded.  K.g.,  the  1st  derivative 
.crion  will  result  in  edges  being  found  in  regions  of  smooth  shading.  Unfortunately, 
papers  do  not  address  issucs  associatcd  with  the  fitting  problem.  E.g.,  polynomial 


rvey  of  Edge  Detection 


Local  Methods  68 


done,  which  Canny  tackles  with  numerical  methods,  yielding  a  family  of  optimal 
olution  kernels,  parametrized  by  K,  the  mean  separation  of  maxima  normalized  by 
tupport  interval.  Qualitatively,  this  family  ranges  in  appearance  from  a  smoothed 
•ence  of  boxes  for  small  values  of  A'  to  a  derivative  of  Gaussian  for  large  values  of 

he  same  time,  he  develops  another  measure  of  multiple  response,  a  local  measure 
1  by 

\fm 

uni 

;ct  a  signal-independent  constraint,  a  proportionality  can  be  required  between  this 
lure  and  the  false  positive  (detection)  measure,  since  they  are  both  normally  dia- 
itcd;  i.c.  one  can  require 

|/(0)|  .(/,»-!) 

II/-II  "  ll/ll 

>rms  of  previously  defined  quantities,  this  can  be  written  as 

KKW  =  kZ 

ough  Canny  seeks  an  /  for  which  k  —  1,  the  best  he  is  able  to  do  is  k  =  .58,  which 
it  loo  surprising,  since  at  this  point  the  constraints  arc  no  longer  all  independent, 
value  is  achieved  for  one  of  the  larger  values  of  K .  The  /  thus  arrived  at  is  well 
oximalcd  by  a  derivative  of  Gaussian,  which  is  desirable  for  case  of  computation, 
icularly  in  a  2-dimcnsional  version.  However,  aside  from  computational  considcra- 
i,  it  is  not  entirely  clear  that  this  is  a  necessary  choice.  Canny  docs  not  make  it 
■  that  one  necessarily  wants  k  —  1,  and  for  that  matter,  the  argument  leading  to  tho 
response  measure  used  in  defining  k  is  less  convincing  than  one  would  like.  It  would 
i  that  where  a  1st  derivative  of  noise  response  is  used,  that  a  2nd  derivative  (possibly 
aged  over  some  neighborhood)  should  be  used.  In  any  case,  he  computes  that  the 
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raw  (i.e.  signal-dependant)  SNR  and  localization  terms  do  not  cancel  when  these  terms 
are  multiplied,  but  result  in  a  coefficient  of  As/n §,  i.e.  if  these  terms  are  not  dropped  in 
defining  the  detection  and  localization  criteria,  the  resulting  product  would  be 


The  problem  of  finding  an  /  to  optimize  the  composite  measure  is  solved  as  a  variational 
problem,  making  an  assumption  of  finite  extent  and  thereby  using  a  tractable  formulation. 
The  set  of  admissible  functions  is  taken  to  be  C°,  which  may  be  slightly  inconsistent, 
since  it  would  seem  that  /  must  be  at  least  C 1  to  conclude  that  the  maximum  of  /  *  /  will 
be  achieved  where  the  derivative  is  0,  which  was  used  in  the  derivation  or  the  optimization 
measure.  Solving  the  variational  problem  leads  to  an  expression  depending  on  a  parameter 
(f or  normalized  /).  It  turns  out  that  the  parameter  can  be  increased  without  bound, 
leading  to  ever  better  /’ s,  and,  in  fact  the  limit  of  the  /’s  is  a  difference  of  boxes  (not  in 
the  admissible  space),  which,  not  too  surprisingly,  is  the  Wiener  filter,  giving  infinitely 
good  localization,  and  the  best  SNR. 

Multiple  response  criterion  and  optimizing  for  all  criteria 

Now,  if  /  is  a  difference  of  boxes,  /  *  /  is  no  longer  smooth,  so  the  derivative  method  of 
finding  maxima  is  called  into  question.  Rut  what  is  more  important,  as  Canny  notes,  the 
maxima  will  be  essentially  as  noisy  as  the  noise.  This  observation  leads  to  an  excellent 
way  of  imposing  a  smoothness  constraint  on  /.  Namely,  one  can  couple  the  requirement 
that  maxima  of  /  *  /  be  sufficiently  isolated  (in  the  mean)  will)  a  formula  giving  the 
separation  value  to  arrive  at  a  smoothness  constraint  on  /  of  the  form 

•  mean  separation  =  jjyjjj  =  KW 

where  W  is  the  support  width,  and  K  the  parameter  which  sets  the  constraint  in  units 
of  W.  This  leads  to  a  complicated  algebraic  problem  once  the  variational  work  has 
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lo  one  can  write 


(/'  *  n)(z)  1  ...  ,  . 

— a/'(o)  =  r  +  Tio) ' (h,ghcr  ordcr  term8 

Taking  root  mean  square  expectation  values,  one  gets 

~  +  aX*  +  ^‘g^er  or<^er  ter™) 

The  right  side  is  taken  as  an  approximation  to  the  standard  deviation  of  z,  the  solvcd-for 
ocation.  The  localization  measure  is  taken  as  the  reciprocal  of  the  left  side,  i.e. 

A  = 

A,  then,  should  also  be  maximised. 

Optimizing  sensitivity  and  localization 
Canny  chooses  lo  optimise  over  the  composite  measure 

v  a  -  (/>*-')in°)i 
ii/ii  •  n/'ii 

based  on  the  observation  that  this  is  a  scale  invariant  quantity,  i.e.  its  value  is  the  same 
Tor  /(*)  as  for  /(c*z).  While  this  seems  to  be  an  interesting  properly,  the  only  argument 
presented  in  its  favor  is  that  the  resulting  measure  depends  only  on  the  “shape”  of  /.  it 
would  be  interesting  lo  pul  this  on  some  stronger  fooling.  10. g.,  the  noise  is  scale  invariant, 
and  so  is  the  step  (when  considered  as  a  function,  though  not  as  a  distribution),  so  there 
is  a  symmetry  argument  for  scale  invariance.  On  the  other  hand,  £2  +  As  or  (E  +  A)a 
[where  E  is  redefined  to  be  always  nonnegative)  also  seem  like  reasonable  candidates  for 
measures  to  be  optimised.  Incidentally,  one  should  note  that  the  A/no  coefficients  in  the 


\m\ 

ii/ii 
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where  (• ,  •)  and  ||-||  are'the  L2  inner  product  and  norm,  respectively,  u_t  is  defined  by 
u-i(t)  =  <),  and  no  is  the  RMS  noise.  A  noise  figure  E  for  the  operator  /  can  then 

be  defined  by 

E  nm 

Part  of  the  optimisation,  then,  is  to  maximise  E. 

Localization  criterion 

The  localisation  is  given  by  the  location  of  the  maximum  of  /*  /.  Canny  equates  finding 
this  maximum  with  solving 

(/  *  n\X) = o 

This  amounts  to  a  smoothness  assumption  on  f  *  I,  which  unfortunately  is  not  explicitly 
stated,  which  makes  it  unclear  what  function  spaces  arc  involved  at  various  stages.  Since 
/  =  Au-\  +  n,  this  is  the  same  as  solving 


(/  *  Au_i)'(z)  +  (/  ♦  n)'(z)  =  0 

(/  *  Au_  i )'  =  AJ  *  «o  —  A/,  and  (/  *  n)'  =  /'  *  n,  so  we  have 

A/(z)  +  (/'  *  n)(z)  —  0 


I.c.,  we  want  to  solve 

Af{z)  =  -(/'  *  n)(x) 

Canny  approaches  this  problem  by  first  observing  that  /  should  be  an  odd  function, 
then  linearizing  the  problem  as  follows.  Near  0,  /  can  be  approximated  by  its  Taylor 
expansion,  so  we  can  write 


Af(x)  =  j4(/'(0)z  +  higher  order  terms)  =  —(ff  *  n)(x) 


A  Survey  of  Edge  Detection 


Local  Methods  64 


of  both  variational  and  numerical  methods.  The  operator  is  extended  to  a  directional 

4 

family  for  2  dimensions.  He  uses  an  adaptive  thresholding  technique  and  a  noise  based 
scale  selection  technique  to  finally  output  a  very  clean  set  of  linked  edges. 

The  1-dimensional  problem 

The  1-dimensional  problem  which  he  poses  is  this.  Assume  that  the  data  consists  of  some 
step  function  in  white  Gaussian  noise,  i.e.  the  data  is  given  by 

I(t)  =  Au_j(t)  +  n(t) 

where  A  is  a  real  constant,  u_i(t)  is  the  unit  step  function,  and  n(t)  is  the  noise  process. 

Assume  further  that  edge  detection  proceeds  by  finding  the  maxima  of  /  *  /,  Tor  some 
convolution  kernel,  /.  The  problem  is  to  find  the  best  /  subject  to  the  following 
performance  criteria: 

1)  Good  detection:  low  false  negative,  false  positive.  Equivalent  to  maximising  S/N 
(signal  to  noise  ratio). 

2)  Good  localisation. 

3)  1  edge  yields  only  1  response. 

Sensitivity  criterion 

The  signal  to  noise  ratio  is  given  by 

SNR  —  signal  response  __  A  jL^  /(x)d» 

RMS  noise  response  /2(z)di)‘/2 

which  we  can  write  more  compactly  as 
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(Shanmugam,  Dickey,  Green  1979].  In  any  case,  as  a  2nd  derivative  operator  it  is  ea- 
thetically  pleasing  because  of  smoothness  and  the  Fourier  domain  symmetry  (i.e.  the 
Gaussian  is  an  eigenfunction,  or  Gaussians  in  general  are  an  invariant  subspace  of  the 
Fourier  transform).  Zero  crossings  are  a  useful  way  to  locate  edges,  but  none  of  the 
“mathematical”  or  heuristic  arguments  presented  about  them  here  are  convincing. 

One  must  regard  the  assumptions  and  techniques  of  this  work  as  tentative  and  experimen¬ 
tal,  rather  than  as  a  well  founded  theoretical  or  practical  system.  The  ideas  arc  based 
on  intuition,  perhaps  good  intuition,  but  lacking  belter  justification  must  be  regarded  as 
only  intuitive.  The  professed  purpose  is  an  explication  of  human  vision.  Unfortunately, 
so  little  is  known  about  human  vision  (c.g.  there  is  no  viable  theory  or  how  any  but 
the  most  rudimentary  information  is  coded  or  utilized),  that  one  cannot  draw  any  con¬ 
clusions  about  the  validity  of  any  Lhcory  purporting  to  explain  human  vision,  and  in 
any  case  it  is  not  our  purpose  to  do  so  here.  For  example,  it  is  clear  that  there  arc  on- 
center  olT-surround  receptive  fields  with  a  response  qualitatively  like  the  DOG.  Rut  one 
can  approximate  the  same  data  with  polynomials,  I  Jesse  I  functions,  or  your  own  favorite. 
The  important  thing  is  the  qualitative  feature  of  smoothly  varying  on-ccnlcr  olT-surround 
response.  The  DOG  may  be  computationally  convenient,  which  is  reason  enough  for  its 
use,  but  the  type  or  convenience  does  not  translate  into  cost  for  a  living  system,  without 
further  analysis.  It  is  fair  to  say  that  the  theory  presented  here  is  not  obviously  ruled 
out,  but  neither  is  it  clearly  the  best  or  only  possibility.  As  far  as  its  being  a  theory  of 
visual  information  or  an  engineering  design,  one  can  only  say  that  it  is  an  interesting  and 
provoking  hypothesis,  but  not  inexorable  or  proven. 

Canny  1983 

• Finding  Edges  and  Lines  in  Images " 

In  this  work,  Canny  begins  by  posing  a  local  1-dimensional  edge  detection  problem  as  an 
optimization  problem  over  the  set  of  convolution  operators,  which  he  solves  by  the  use 
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[d*f/dxidxj].  Note  thrft  D\J  =  D„(D„f),  where  Dv  is  the  directional  derivative  in  the  v 
direction).  Unfortunately,  they  seem  to  be  mainly  thinking  about  the  special  case  where 
df/dy  —  0;  they  develop  some  strange  ideas  about  the  conditions  and  ramifications  of 
their  maximum  slope  concern.  They  state  and  claim  to  prove  a  theorem  to  the  effect 
that  the  condition  (confusingly  stated)  obtains  if  and  only  if  df/dy  =  constant.  They 
arc  somewhat  careless  in  the  proof  of  the  if  part,  failing  to  explicitly  consider  the  slope. 
In  fact,  what  they  are  trying  to  show,  in  their  notation,  is  that  cos3  9  •  fxxx  attains  a 
strict  maximum  Tor  0  =  0.  This  will  be  true  if  /  is  3  times  differentiable  and  fxxx  7^  0. 
However,  the  authors  have  only  assumed  that  /  £  C2,  and  they  neglect  the  possibility 
that  fxxx  may  vanish.  The  only  if  part  and  its  “proof”  arc  omitted  from  the  Proc.  R. 
Soc.  version  of  the  paper,  and  wisely  so,  for  they  arc  erroneous;  the  purported  proof 
shows  only  that  D2f  may  not  be  0  For  v  x.  See  [Canny  1983}  for  a  more  coherent  use 
of  2nd  derivative,  and  the  review  of  [Canny  1983]  for  more  discussion. 

The  authors  assume  that  coincident  zero  crossings  from  a  set  of  contiguous  channels 
imply  a  real  edge  and  conversely.  This  so-called  spatial  coincidence  assumption  is  not 
well- supported  by  any  argument.  (E.g.  sec  [Canny  1983}  for  pictorial  counterexamples.) 
The  only  situation  for  which  it  really  makes  sense  is  that  or  a  very  sharp  edge  between 
fairly  large  constant  areas.  Otherwise,  it  seems  perfectly  reasonable  to  believe  that  the 
edge  will  be  visible  at  only  I  scale,  while  smaller  scales  will  have  inadequate  sensitivity, 
i.e.  their  zero  crossings  will  be  essentially  random,  and  larger  scales  may  include  other 
features,  so  their  zero  crossings  will  depend  (arbitrarily)  on  those  features  as  well. 

Evaluation 

The  paper  presents  no  convincing  arguments  that  the  V2G  or  DOG  operator  is  optimal 
or  otherwise  privileged  in  this  context.  However,  for  the  purpose  of  step  edge  detec¬ 
tion  a  particular  kind  of  optimality  under  some  conditions  has  been  shown  elsewhere 
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matched,  yielding  a  simple  result  in  the  case  of  a  step  edge.  In  that  case,  the  approximate 
filter  has  an  impulse  response  which  is  the  2nd  derivative  of  a  Gaussian  (see  review  of 
[Shanmugam,  Dickey,  Green  1979]  for  details).  For  the  ranges  that  the  approximations 
are  valid  (see  review  of  [Shanmugam,  Dickey,  Green  1979]),  this  vindicates  the  use  of  the 
Laplacian  of  the  Gaussian  by  Marr  and  Hildreth,  but  only  for  the  specific  type  of  matched 
filtering  of  step  edges  studied  by  (Shanmugam,  Dickey,  Green  1979],  though  [Marr  and 
Hildreth  1979]  makes  no  mention  of  the  type  of  analysis  in  [Shanmugam,  Dickey,  Green 
1979],  basing  the  use  of  the  Gaussian  on  the  more  nebulous  grounds  mentioned  above. 

The  authors  arc  interested  in  finding  points  of  maximum  directional  derivative  as  edge 
locations,  and  they  choose  to  locate  these  as  zero  crossings  of  a  2nd  derivative.  Based 
on  cost  considerations  they  opt  for  an  isotropic  2nd  derivative  operator,  the  Laplacian 
V2  (the  only  such),  and  wish  to  compute  V2(C  *  /),  where  G  is  the  Gaussian.  Since 
V2(G'1  *  /)  =  (V2C)  *  /,  they  want  to  convolve  with  V2G,  which  they  approximate  as  a 
difference  of  Gaussians  (DOG). 

Logan’s  theorem  (rcconstructibilily  of  analytic  1  octave  bandpass  signals  from  their  zero 
crossings)  is  invoked  to  help  justify  use  of  zero  crossings.  However,  the  theorem  is 
applicable  only  Tor  I  dimension,  and  the  signals  involved  here  have  a  bandpass  of  nearly  2 
octaves.  An  argument  is  made  that  slope  information  may  be  adequate  to  bridge  the  gap 
(in  analogy  to  the  situation  for  the  sampling  theorem).  On  the  other  hand,  there  is  no 
reason  why  rcconstructibilily  should  be  a  criterion,  since  there  is  never  any  requirement 
that  an  image  understanding  system  should  be  able  to  reconstruct  the  input  signal. 

It  appears  that  the  authors  arc  concerned  that  the  zero  crossing  direction  be  perpendicular 
to  the  direction  of  “maximum  slope  of  the  directional  derivative.”  Apparently,  what 
this  is  supposed  to  mean  is  that  VD2/  should  be  collincar  with  v  (where  D\  is  the  2nd 
directional  derivative  in  the  v  direction,  the  second  derivative  of  a  section  of  /  taken  along 
a  line  in  the  v  direction,  which  can  be  written  as  vT Hv,  where  H  is  the  Hessian  matrix 
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lead  to  the  frequency  domain.  If  one  regards  “scale”  as  referring  to  rate  of  change,  then 
normalising  a  bandlimited  function  bounds  the  derivative,  but  the  converse  need  not 
be  true.  Thus,  bandlimiting  can  be  regarded  as  one  way  to  limit  scale  in  some  sense. 
However,  no  arguments  are  presented  to  bolster  the  desire  to  consider  the  frequency 
domain.  The  reason  that  frequency  domain  methods  work  in  engineering  is  the  fact  that 
exponentials  are  eigenfunctions  of  linear  translation  invariant  operators,  so  one  can  use 
superposition  to  combine  the  effects  of  various  bandpasses.  Related  is  the  convenient 
fact  that  convolutions  arc  mapped  to  multiplications.  The  work  under  consideration 
uses  exclusively  linear  methods,  but  docs  not  present  such  an  argument.  On  the  other 
hand,  if  one  uses  nonlinear  methods,  there  is  no  such  justification.  (Sec  the  review  of 
[Shanmugam,  Dickey,  Green  1979]  for  another  argument  supporting  bandlimiting.) 

The  authors  argue  further  that  the  conflicting  requirements  or  space-  and  band-limiting 
arc  optimally  reconciled  by  minimizing  the  space-bandwidth  product.  For  the  appropriate 
definition  of  these  terms,  it  is  well  known  that  the  Gaussian  (  e~kx* ,  for  the  right  k  ) 
achieves  the  minimum,  so  the  authors  conclude  that  the  fillers  they  want  arc  Gaussian. 
Unfortunately,  even  if  one  accepts  the  doctrine  or  band-limiting,  it  is  by  no  means  clear 
that  the  Gaussian  is  optimal.  In  the  first  place,  the  Gaussian  is  neither  strictly  band- 
limited  nor  strictly  space-limited.  When  one,  say,  bandlimits  by  truncation,  it  is  no 
longer  optimal.  If  one  requires  a  strictly  band-limited  or  space-limited  function,  i.c. 
one  which  is  0  outside  or  a  given  interval  in  cither  the  spatial  or  frequency  domain, 
the  work  of  [Slcpian  and  Poliak  1961,  Landau  and  Poliak  1961,  Landau  and  Poliak 
1962]  and  (Shanmugam,  Dickey,  Green  1979]  shows  that  the  optimal  filter  has  a  transfer 
function  which  is  essentially  a  prolate  spheroidal  wave  function  divider!  by  the  transform 
of  the  waveform  to  be  matched,  where  optimality  is  defined  in  terms  of  concentrating 
energy  in  a  spatial  interval,  rather  than  minimizing  space-bandwidth  product.  Under 
some  conditions,  the  prolate  spheroidal  wave  functions  can  be  approximated  by  functions 
related  to  the  Gaussian.  However,  the  optimal  filter  still  depends  on  the  function  to  be 
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be  strictly  space- limited.  This  is  the  case  if  one  convolves  a  mask  with  the  image.  If 
one  were  interested  in  concentrating  energy  in  some  band,  then  the  problem  would  be 
the  dual  of  the  one  considered  in  the  reviewed  paper,  viz.  to  find  the  optimal  space- 
tor  time-)  limited  filter  for  concentrating  its  energy  in  some  frequency  band.  With  the 
duality  of  the  Fourier  transform,  the  solution  is  essentially  the  same:  Now  an  argument 
can  be  made  for  considering  the  frequency  domain  based  on  noise  considerations,  for  as 
the  authors  show,  the  signal  to  noise  ratio  is  a  function  of  the  space-bandwidth  product. 
Since  white  Gau&sian  noise  has  constant  spectral  power  density,  the  frequency  domain  is 
a  natural  setting  for  its  analysis.  Unfortunately,  for  good  present-day  images,  the  true 
noise  is  of  the  same  order  as  the  digitization  noise,  and  most  of  the  “noise”  really  comes 
from  real  variations  in  the  image,  i.e.  from  the  fact  that  the  image  is  not  in  the  space 
of  ideal  features.  It  is  not  clear  whether  this  type  of  “noise”  can  properly  be  regarded  as 
white  and  Gaussian;  c.g.,  it  is  not  perfectly  uncorrclatcd. 

On  the  other  hand,  it  would  indeed  be  satisfying  to  learn  that  bandlimiling  u  required 
for  some  strong  inherent  reason,  so  the  prolate  spheroidal  wave  functions  arc  worth 
experimenting  with,  and  should  at  least  be  kept  in  mind. 

Marr  and  Hildreth  1979 
* Theory  of  Edge  Detection * 

The  authors  arc  concerned  with  finding  a  smoothing  filter  which  will  analyze  the  visual 
input  into  a  number  of  channels  related  to  physical  scale. 

They  argue  that  such  a  filter  should  operate  over  a  subrange  of  scales— not  over  all  scales 
possible  in  the  image.  Furthermore,  it  should  be  spatially  localized.  From  this  they 
infer  the  (contradictory)  requirements  that  the  filter  be  both  band-limited  and  space- 
limited.  Although  space  limiting  clearly  follows  from  the  localization  requirement,  the 
band-limiting  conclusion  is  on  shaky  ground,  since  the  idea  of  “scale”  docs  not  inexorably 
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where  Cl  is  the  bandwidth  (i.e.  the  signal  is  nonzero  only  when  w  6  (—0,0)),  and  the 
energy  is  to  be  concentrated  in  the  spatial  interval  (— /,  /)  (note  this  is  a  slightly  different 
use  of  I  than  in  [Shanmugam,  Dickey,  Green  1979,  Lunschcr  1983]).  These  conditions 
say  that  the  approximation  is  valid  Tor  large  space-bandwidth  products,  and  under  those 
conditions  it  is  valid  only  away  from  the  band  limits.  If  those  conditions  are  violated, 
e.g.  by  requiring  better  localization,  then  the  prolate  spheroidal  wave  function  which  is 
the  solution  no  longer  looks  like  a  Gaussian.  This  is  similar  to  the  localization  results 
found  by  [Canny  1983]. 

Rlurrcd  edges  are  modelled  as  the  difference  of  exponentials  to  obtain  a  symmetrical 
sigmoid  function  (only  once  continuously  differentiable,  though).  They  show  that  if  the 
resolution  interval  l  is  larger  than  the  blur  width  (defined  by  the  90%  points),  then  the 
filter  is  still  a  good  approximation  to  optimal  in  an  appropriate  sense. 

A  Caussian  noise  analysis  is  also  presented,  showing  that  S/N  improves  with  increasing 
space-bandwidth  product,  e.g.  coarser  resolution,  not  a  very  surprising  result  in  view  of 
many  others  to  the  same  effect.  An  expression  for  S/N  is  given. 

The  experimental  results  arc  not  very  impressive  when  compared  to  nonlinear  edge 
detectors  (e.g.  after  thresholding),  but  they  show  a  clear  improvement  over  other  standard 
linear  filters,  e.g.  high  pass,  Laplacian. 

Emhiatian 

This  is  not  a  direct  method  of  detecting  edges,  but  rather  should  be  regarded  cither  as  an 
enhancement  method,  or,  more  importantly,  as  n  precise  approach  that  could  be  taken 
in  finding  an  optimum  filter  to  reconcile  space-  and  band-limiting. 

If  one  must  do  computations  in  the  frequency  domain,  then  the  filter  used  must  be  strictly 
band-limited.  But  there  is  no  persuasive  argument  for  using  the  frequency  domain.  If 
one  does  computations  in  the  spatial  (or  time)  domain,  then  of  course  the  filter  must 
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where  Ki ,  Kj  are  simpje  functions  of  0,  /.  When  the  ideal  input  is  a  step  edge,  this 
reduces  to 

H{u)  =  Kiuae~K*ut 

The  authors  allude  to  work  by  Strcifer  (Streifer  1965b]  showing  that  “the  error  is  not 
prohibitive  even  when  Slepian’s  constraints  are  violated.”  [Lunschcr  1983]  has  pointed  out 
a  dimensional  error  in  the  exponent  above,  and  uses  asymptotic  expansions  of  [Streifer 
1965a,  Streifer  1965b]  to  arrive  at  a  K%  of  the  correct  dimensions  to  assure  a  scale- 
invariant  response. 

The  optimality  of  Gaussians 

What  docs  this  say  about  the  optimality  of  Gaussians?  Since  Gaussians  minimize  the 
spaco-bandwidtli  product  for  functions  of  infinite  (frequency)  extent,  one  would  expect 
that  the  imposition  of  a  finite  extent  constraint  would  lead  to  a  result  which  approached 
a  Gaussian  asymptotically.  The  question  then  becomes  whether  the  conditions  for  the 
asymptotic  approximation  arc  applicable  in  a  particular  situation.  For  example,  if  one 
starts  with  a  Gaussian  which  is  approximately  band-limited,  say  99%  of  its  energy  is 
within  (— n,  H),  then  that  Gaussian  has  a  particular  spatial  extent,  too,  parametrized  by 
its  standard  deviation,  so  99%  of  its  energy  is  in  the  spatial  interval  (— f,  /),  where  l  is  the 
appropriate  multiple  of  o.  Now  if  we  arc  demanding  that  the  function  we  arc  interested 
in  must  concentrate  its  energy  in  the  interval  (-.01  •  /,  .01  •  /),  then  clearly  the  Gaussian 
will  not  be  a  very  good  approximation. 

For  the  scale-invariant  version  of  the  prolate  spheroidal  approximation  due  to  [Lunscher 
1983],  the  domain  of  validity  is  defined  by 
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Following  [Slepian  and  Poliak  1961,  Landau  and  Poliak  1961,  Landau  and  Poliak  1962] 
they  decompose  in  terms  of  prolate  spheroidal  wave  functions,  and  show  that  the  optimal 
filter  output  is  the  order  1  prolate  spheroidal  wave  function,  with  the  space- bandwidth 
parameter  dependent  on  the  space  and  bandwidth  cutoffs  required.  This  method  of 
analysis  allows  the  bandwidth  and  space  cutoff  to  be  chosen  independently,  unlike  the 
situation  with  a  Gaussian.  This  constitutes  a  more  realistic  treatment  of  the  type  of 
optimality  sought  in  [Marr  and  Hildreth  1979],  yielding  functions  other  than  Gaussians, 
although  under  certain  ranges  of  parameters  the  Gaussian  is  a  good  approximation. 
Specifically,  the  transfer  function  of  the  optimal  filter  is  given  by 


M  <  n 
M  >  n 


where  K  is  a  real  constant,  V>i  is  the  1st  order  prolate  spheroidal  wave  function,  11  is  the 
half  bandwidth  (i.c.  the  signal  is  nonzero  only  when  w  €  (—11,11)),  and  the  energy  is  to  be 
concentrated  in  the  spatial  interval  (—7,  /)  (note  this  is  a  slightly  different  use  or  I  than 
in  [Shanmugam,  Dickey,  Green  1979,  Lunschcr  1983]),  and  F(u)  is  the  Fourier  transform 
of  the  ideal  input.  The  only  information  used  about  the  input  and  Tiller  to  derive  this 
formula  is  the  fact  that  they  arc  odd  and  even  functions,  resp.  There  is  no  particular 
justification  for  requiring  the  filter  to  be  even  (it  gives  a  neater  result)  except  that  it  allows 
ready  generalization  to  a  rotationally  invariant  2-diincnsional  operator  simply  by  making 
the  value  depend  only  on  distance  from  the  origin,  i.c.  by  rotating  the  1-dimcnsional 
operator.  [Canny  1983]  regards  this  as  a  rather  unfortunate  assumption,  since  in  his 
analysis  directional  operators  provide  better  sensitivity,  and  he  shows  that  dropping  the 
assumption  leads  to  an  operator  very  much  like  the  one  he  proposes. 


Using  an  approximation  of  Slepian  (Slepian  1965],  the  optimal  filter  within  the  bandpass 
is  approximated  as 

tfiwe-**"* 
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least  squares  fits  (whidi  are  being  proposed)  are  notorious  for  being  badiy  behaved — they 
tend  to  have  extra  wiggles.  One  would  expect  that  such  functions  would  not  be  very 
good  ones  to  use  if  one  wanted  to  look  at  derivatives.  One  might  prefer  to  use  Fourier 
interpolation,  B-spiines,  Fourier  splines,  or  some  other  appropriately  well-behaved  set 
of  functions.  No  mention  is  made  of  his  previous  idea  of  looking  at  discontinuities  of 
parameters  of  fit  between  adjacent  regions.  Nevertheless,  some  kind  of  fitting  process 
seems  to  be  in  order  to  use  global  information  for  local  features  (in  this  case  the  global  fit 
yields  the  local  derivative).  The  noise  performance  issue  is  postponed  in  [Ilaralick  1981], 
but  treated  thoroughly  in  [Ilaralick  1984]. 

These  “edge”  detectors  arc  a  beginning  based  on  surface  fitting.  The  particular  predicates 
involved  arc  not  adequate,  though,  and  therefore  cannot  be  expected  to  give  outstanding 
performance  (sec  [Canny  1983]  for  one  discussion  of  performance).  Improvements  can  be 
expected  when  the  type  of  qualitative  information  used  in  [Ilaralick,  Watson,  Laffey  1983] 
is  brought  to  bear  on  finding  edges. 

Optimal  Filtcra 

Shanmugam,  Dickey,  Green  1979 

“Art  Optimal  Frequency  Domain  Filter  for  Edge  Detection  in  Digital  Image*" 

The  authors  consider  the  1-dimcnsional  edge  detection  problem,  with  the  proviso  that 
“symmetries  appropriate  to  the  2-dimensional  problem  arc  retained.”  Their  goal  is  to 
obtain  a  frequency  domain  filter  to  concentrate  maximal  energy  near  an  edge.  The  model 
Tor  an  input  edge  is  the  unit  step. 

More  particularly,  the  authors  require  a  strictly  bandlimilcd  filter  (i.c.  a  filter  whose 
Fourier  transform  has  its  support  on  an  interval  surrounding  the  origin),  and  they  seek 
to  maximize  the  power  in  some  interval  around  the  origin  in  the  space  domain  for  the 
filter  output  response  to  a  unit  step. 
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Gaussian  approximation  has  a  performance  measure  EA  which  is  about  20%  worse  than 
the  optimum  operator. 

The  2-dimensional  problem 

Canny  does  not  consider  the  2-dimensional  optimization  problem  de  novo,  but  rather 
starts  from  the  point  he  reached  with  the  1-dimensiona!  problem,  which  is  the  derivative* 
of-Gaussian  operator.  The  approach  is  to  use  an  operator  of  the  form  h(x,y)  —  f(x)  • 
g(y),  for  various  orientations  of  the  orthogonal  coordinates  x,y.  Then  /  is  to  be  the 
(approximate)  optimal  l-dimcnsional  operator,  and  g  must  be  determined.  Hy  reasoning 
similar  to  that  involved  in  finding  /,  he  notes  that  g  should  be  smooth,  i.e.  a  smooth 
window  function,  and  he  notes  that  the  Gaussian  he  chooses  is  a  good  approximation  to 
standard  windowing  functions.  First,  the  edge  orientation  is  estimated  from  the  gradient 
of  the  smoothed  image,  i.e.  from 

V(C  *  /) 

where  G  is  a  rotationally  symmetric  Gaussian.  Then  the  location  or  the  edge  is  determined 
by  finding  the  zero  crossings  of  an  operator  which  computes  D\G  *  l,  the  2nd  directional 
derivative  in  the  v  direction,  where  v  is  approximately  given  from  the  gradient  estimate. 
This  is  what  it  would  seem  (Marr  and  Hildreth  ($170]  were  really  after.  Notice  that  this 
can  be  realized  as  a  single  operator  (i.e.  it  is  not  a  directional  family)  since  one  seeks  the 
zeroes  of 

O  ygS 

where  5  =  G  *  /. 

A  compact  description  of  the  2nd  derivative  operator  is  ns  follows.  The  2nd  derivative  for 
a  function  /  of  2  variables  can  be  thought  or  as  a  matrix,  known  as  the  Hessian  matrix, 
given  by  //  =  [f?a//t)z,7Jxy].  The  2nd  directional  derivative  in  the  v  direction  is  then 
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Dbff  =  (/.  A) 


/  /x*  /xy\/  /x\ 

\/x»  Ay/vA/ 


which  expands  to 


*>&//  =  /x/xx  +  %/xfyfxy  +  tyyy 


Normalizing  by  |V/|  does  not  alter  the  zeroes  of  this  quantity. 

Canny  makes  a  useful  observation  in  comparing  this  directional  derivative  operator  to 
the  Laplacian,  V2(G  *  I),  which  is  worthwhile  repeating  here.  Consider  a  coordinate 
system  with  the  z-axis  aligned  with  the  gradient  direction  (at  the  point  of  interest).  In 
this  coordinate  system,  the  directional  derivative  has  a  contribution  only  from  fxx,  since 
/„  =  0,  since  the  y  direction  is  orthogonal  to  the  gradient,  which  is  as  it  should  be 
for  a  directional  derivative.  The  Laplacian,  on  the  other  hand,  also  is  invariant  under 
rotation,  but  does  not  depend  on  the  gradient,  hence  it  has  a  contribution  from  the 
2nd  derivative  in  the  “uninteresting”  {/-direction,  which  leads  to  nothing  but  a  noise 
contribution.  Actually,  this  is  a  little  more  subtle  than  may  appear  at  first  glance.  If 
one  assumes  an  ideal  edge  as  signal  embedded  in  noise,  then  the  signal  is  completely 
constant  in  the  y  direction;  hence  all  y  derivatives  will  be  0,  so  in  fact  the  directional 
derivative  gives  the  same  answer  as  the  Laplacian  (modulo  a  first  order  coefficient  which 
would  be  normalized  away),  for  the  signal  response  alone.  But  with  noise,  the  Laplacian 
will  respond  in  the  y  direction,  while  the  directional  derivative  will  not. 

From  this  theme  (the  second  directional  derivative  of  Gaussian  convolution),  Canny 
proceeds  to  develop  a  number  of  variations:  multiple  widths,  “feature  synthesis,”  elon¬ 
gated  operators,  and  lateral  inhibition. 


.'•1 


Multiple  widths  arc  required  since  sensitivity  increases  with  size  of  support,  while  localisa¬ 
tion  degrades.  Canny’s  approach  is  to  use  the  smallest  operator  with  sensitivity  adequate 
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to  provide  a  given  erro/ probability.  This  requires  estimating  the  noise,  which  he  does  by 
convolving  a  filter  with  the  edge  detector  output.  Under  the  assumption  that  the  signal 
is  an  ideal  step  and  the  step  size  is  much  larger  than  the  noise  amplitude  (low  noise), 
he  finds  that  the  optimal  filter  for  this  is  the  2nd  derivative  of  a  delta  function,  and 
smoothing  the  response  gives  the  2nd  derivative  of  a  Gaussian.  This  he  approximates 
with  a  difference  of  Gaussians,  which  is  less  sensitive  to  the  accuracy  of  the  edge  location 
estimate,  with  coefficients  chosen  so  as  to  make  it  orthogonal  to  the  step  response.  Of 
course  measuring  noise  involves  a  model  of  the  image,  in  this  case  an  ideal  step,  so  the 
noise  measurement  is  also  a  measurement  or  the  deviation  from  the  model.  Nevertheless, 
since  it  is  also  a  measure  of  the  lit  of  the  model,  it  is  still  useful  as  a  confidence  measure. 

It  is  a  fact  of  life  that  images  arc  not  composed  of  ideal  step  edges.  Consequently, 
operators  of  dilTcrcnt  sizes  with  adequate  S/N  centered  at  the  same  point  may  be  respond¬ 
ing  to  different  aspects  of  the  image  function.  The  simplest  example  is  a  diffuse  edge 
superimposed  on  a  sharp  one  (possibly  at  a  different  orientation).  In  general,  the  single 
number  that  a  filter  gives  at  a  point  docs  not  convey  a  great  deal  of  information  about 
the  structure  of  the  image  in  a  neighborhood  of  that  point.  In  particular,  the  response 
of  an  edge  operator  based  on  an  assumption  of  ideal  edges  gives  very  little  information 
about  Lhe  shape  of  actual  edge  candidates.  Canny’s  approach  to  this  problem,  “feature 
synthesis,”  is  reminiscent  of  the  Gram-Schmidt  orthogonalization  procedure.  The  idea  is 
that,  starting  with  the  response  from  the  smallest  significant  operator,  he  estimates  what 
the  response  from  the  next  largest  would  be  if  a  step  edge  were  responsible.  If  there  is  a 
large  enough  disparity  with  the  observed  response  it  is  deemed  to  come  from  something 
else.  This  has  the  effect  of  enlarging  the  feature  space. 

Elongating  a  maik  along  the  edge  direction  is  another  way  to  increase  the  support,  hence 
the  sensitivity,  but  since  there  is  no  scale  change,  the  localization  improves  (under  the 
ideal  straight  edge  assumption).  Canny’s  elongated  operator  is  essentially  the  sum  of 
Gaussians  taken  along  an  interval,  resulting  in  a  mesa  shape  with  Gaussian  fall-off. 
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A  common  problem  wit  IT  local  operators  that  respond  to  the  gradient  is  that  they  respond 
to  slow  changes  as  well  as  abrupt  ones.  This  can  be  regarded,  again,  as  a  symptom  of  an 
inadequate  feature  space  (the  1-dimcnsional  “edgeness”  number,  essentially  a  projection 
to  a  1-dimensional  space).  One  remedy  for  this  problem  is  to  introduce  a  preprocessing 
step,  lateral  inhibition,  which  sends  olTcnding  subspaccs  to  0.  In  the  context  of  the  Taylor 
expansion,  the  first  olTcnding  subspacc  is  the  constant  term,  but  this  is  already  taken  care 
of  as  long  as  the  operator  has  0  average  value,  e.g.  if  it  is  an  odd  function.  The  next 
problem  is  the  1st  order  term,  which  is  an  example  of  a  “smooth  gradient.”  This  can 
be  removed  by  some  2nd  order  operation,  c.g.  2nd  derivative.  Canny  uses  a  difference 
of  1st  derivative  of  Gaussians  of  different  widths,  weighted  to  send  linear  functions  to 
0,  i.e.  roughly  the  difference  of  adjacent  channels  of  the  optimal  operator.  Like  other 
lateral  inhibition  methods,  this  degrades  the  performance,  in  this  ease. by  about  30%.  It’s 
not  too  hard  to  see  what  the  problem  is  here.  First  of  all,  the  operator  /  was  chosen  to 
maximize 

h  =  J  /•  1i_t  =(/, «_|) 

Without  any  constraints  (and  an  appropriate  measure),  this  is  achieved  for  /  =  u_|. 
With  the  extra  constraints,  we  can  think  of  it  as  finding  the  dual  vector  of  u_j,  or  we 
can  put  everything  into  the  measure,  in  this  ease  Gaussian  measure.  Now  to  compute  the 
u_i-ncss  of  I,  we  compute  (/,/),  i.e.  we  apply  the  distribution  /  to  l,  or  equivalently, 
look  at  the  projection  on  the  tt-j-axis.  Unfortunately,  it  turns  out  that  (/,  t)  ^  0  (where 
t  stands  for  the  identity  function  on  the  line).  I.e.  u_i {t)  is  not  orthogonal  to  t.  The 
idea  or  sending  t  to  0,  then,  is  to  find  some  g  such  that  (g,t)  =  0,  but  (</, «-i)  is  still  as 
large  as  possible.  This  essentially  means  making  /  orthogonal  to  t.  Roughly  speaking, 
one  can  think  of  u_i  and  t  as  2  vectors  in  an  inner  product  space.  /  is  derived  from 
orthogonal  projection  onto  u_i,  but  since  u_i  and  t  arc  not  orthogonal,  t  has  a  u-\ 
component.  Finding  something  that  will  not  respond  to  t,  i.e.  send  it  to  0,  means  finding 
some  new  vector  v  in  the  subspacc  orthogonal  to  t.  u_i  cannot  lie  in  this  subspacc,  since 
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it  is  not  orthogonal  to  <,  so  the  v  component  of  u_i  will  be  reduced.  One  way  to  get 
this  is  to  subtract  off  the  t  component  of  u-i,  a  canonical  orthogonalization.  This  would 
be  the  Gram-Schmidt  orthogonalization  for  In  any  case,  sensitivity  is  lost.  But 

consider  the  problem  another  way.  Consider  the  subspace  spanned  by  u_i,  t,  and  instead 
of  orthogonal  projection  to  a  1-dimensional  subspace  (the  value  of  a  single  inner  product), 
look  at  the  orthogonal  projection  to  this  2-dimcnsional  subspace.  I.e.,  try  to  find  the  best 
fit  of  the  form 

ati_j  +  bt 

Then  a,  e.g.,  can  be  found  by  orthogonalizing  the  basis  and  renormalizing.  I.e.  the 
problem  is  just  that  of  writing  a  vector  in  a  non-orthogonal  basis.  E.g.,  for  the  above 
subspace,  one  gets 

_  (Att-t) -(“-!»*)(*.*) 

1 

where  g  denotes  g  normalized  for  the  appropriate  measure  and  support  interval.  This  is  for 
a  single  support  interval,  so  it  cannot  be  directly  compared  with  Canny’s  method,  which 
uses  results  from  2  different  support  intervals,  without  first  specifying  what  measures, 
support  intervals,  and  subspaces  one  was  interested  in. 

Using  optimization  methods  analogous  to  those  Tor  his  edge  operators,  Canny  also  finds 
optimum  detectors  for  “roofs"  and  “ridges,"  and  indeed,  this  could  be  done  for  any 
distribution.  Extending  these  operators  to  2  dimensions  is  somewhat  trickier  than  for  the 
edge.  And  as  the  number  of  operators  increases,  it  becomes  harder  and  harder  to  make 
sense  of  their  outputs,  since  they  are  not  mutually  orthogonal.  More  than  I  operator,  e.g., 
can  simultaneously  give  above  threshold  output.  Furthermore,  they  may  be  applied  at 
different  scales  and  at  different  orientations.  Canny  calls  this  problem  of  understanding 
all  these  outputs  the  “integration”  problem,  and  concedes  that  it  is  a  sticky  one.  The 
problem  really  sterns  from  the  lack  of  a  coherent  way  to  describe  the  image  (locally). 
Projecting  onto  some  axes  that  seem  interesting  is  a  start,  but  such  a  local  projection 
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yields  only  a  few  numtfers  from  which  it  is  hard  to  derive  a  picture  of  the  qualitative 
behavior  of  the  image.  This  is  a  problem  of  disintegration,  or  fragmentation.  What  is 
required  is  a  coherent  way  of  describing  the  qualitative  structure  of  the  image,  in  terms 
of  the  structures  which  are  of  interest. 

Linking 

Canny  uses  a  fairly  effective  solution  to  the  “streaking”  problem  (breaking  up  of 
thrcsholdcd  edges),  which  he  calls  “thresholding  with  hysteresis.”  While  this  method 
does  not  address  the  basic  problem  understanding  the  image  globally  and  semantically 
(in  terms  of  “edges”) — it  is  very  workable  in  the  ambient  context  of  ideal  edges  Found  by 
a  local  linear  operator.  The  technique  outputs  contours  which  are  the  maximal  connected 
contours  with  some  part  above  a  high  threshold  and  all  parts  above  a  low  threshold.  This 
is  equivalent  to  seeding  with  strong  edges  (those  above  the  high  threshold)  and  following 
the  contour  at  a  lower  threshold.  This  is  still  not  quite  the  same  as  detecting  a  weak 
edge  due  to  its  length,  something  commonly  beyond  the  ability  of  edge  detectors.  The 
source  of  the  problem  is  that  one  is  attempting  to  find  a  global  object  based  on  local 
measures.  Deciding  on  a  continuation  of  an  edge  through  a  region  of  poor  signal  to 
noise  ratio  is  not  a  local  problem,  ami  it  is  not  clear  how  to  treat  it  ns  a  signal  process* 
ing  problem.  One  could  try  to  look  for  evidence  or  long  straight  edges.  Canny  docs 
this  to  some  extent  by  using  elongated  operators.  Uut  Tor  much  longer  edges,  this  is 
no  longer  a  local  criterion,  and  special  methods  are  required  to  deal  with  the  increased 
combinatorial  load  (c.g.  a  directional  operator  would  have  to  be  applied  in  very  many 
directions).  Alternately,  a  method  akin  to  the  Hough  transform  (Hough  1902]  could  be 
used.  However,  these  methods  have  a  bias  toward  particular  shapes  of  contour;  again, 
since  the  locality  condition  is  violated,  including  additional  dimensions  of  shape  is  very 
costly.  There  arc  semantic  problems  as  well.  Consider  the  case  of  a  long  straight  edge 
in  a  high  noise  background  (or,  equivalently,  of  low  contrast).  Looking  at  such  data,  a 
human  observer  generally  secs  in  fact  a  straight  edge  in  noise.  However,  it  is  clear  that  if 
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the  edge  is  long  enough  »nd  the  noise  is  large  enough,  there  must  be  places  along  it  where 
an  estimator  based  only  on  local  data  will  meander  slightly  to  get  the  best  estimate.  By 
the  local  measure,  this  will  be  a  better  estimate  than  a  long  straight  edge.  That  is  all 
one  can  expect  from  a  criterion  which  ignores  the  global  shape  of  the  curve. 

Empirical  results 


The  results  shown  appear  quite  good  in  that  most  of  the  edges  of  interest  arc  present 
without  a  preponderance  of  “noise”  edges.  Probably,  this  is  mostly  due  to  using  lo¬ 
cal  noise  analysis  with  “hysteresis”  thresholding  and  incorporating  the  single  response 
criterion  (smoothing).  While  accurate  localisation  is  important  for  applications  requiring 
precision,  subjective  appraisal  cannot  take  this  into  account  very  readily,  and  the  main 
manifestation  of  good  localization  is  in  localization  consistency,  i.c.  in  the  estimated 
location  varying  smoothly  and  rnonotonically  with  the  actual  while  maintaining  minimal 
scatter.  This  feature  enables,  c.g.,  reliable  linking,  even  if  the  absolute  locations. are 
unreliable.  However,  here  the  single  response  criterion  already  clears  the  clutter,  so  the 
localization  accuracy  is  probably  not  very  important  in  this  respect. 


To  the  extent  that  the  results  are  apparently  cleaner  than  other  edge  detectors,  they  are 
very  good.  However,  Canny’s  results  show  the  same  topological  problems  inherent  in  all 
step-matcher  filters.  These  problems  are  manifested  in  “wrong”  connectivity  of  contours 
(i.c.,  relative  to  what  a  human  would  draw),  and  occur  in  places  where  the  image  function 
exhibits  a  local  behavior  which  is  different  from  the  class  of  functions  considered  in  the 
design.  In  Canny’B  case,  the  design  functions  were  constants  and  ideal  straight  edges, 
possibly  augmented  by  linear  functions.  ThiB  can  be  expected  to  fail  in  busy  places, 
e.g.  corners,  resulting  in  incorrect  connectivity,  and  in  fact  such  behavior  is  evident  in 
Canny's  examples. 
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Evaluating 

This  work  is  a  significant  contribution  to  the  theory  of  edge  detection  by  finding  extrema 
of  a  convolution.  Particularly  noteworthy  are  the  ideas  relating  to  localization:  how  to 
express  it  mathematically,  and  how  to  incorporate  a  single  response  criterion.  The  latter 
leads  directly  to  smoothing,  and  thus  puts  the  use  of  smooth  convolution  kernels  on 
a  firm  footing  (in  computer  vision — we  are  not  speaking  of  statistics  in  general).  The 
comparisons  with  other  edge  operators,  e.g.  Laplacian,  Ilucckcl,  surface  fitting  are  quite 
useful,  as  is  the  discussion  of  prolate  spheroidal  wave  functions.  The  ideas  have  their 
germ  in  the  work  of  Marr  and  Hildreth,  but  go  much  further  in  the  way  of  development, 
sophistication,  and  rigor,  and  arc  certainly  creative  on  their  own. 

Without  detracting  from  the  quality  of  the  work,  the  subject  matter  should  be  put  in  some 
perspective.  Each  refinement  that  Canny  introduces  can  be  regarded  as  an  enlargement 
or  refinement  of  some  linear  feature  space,  which  is  computed  pointwise  in  the  image.  By 
considering  sufficiently  complicated  convolutions,  perhaps  a  great  deal  can  be  determined 
about  the  image  function,  although  clearly  convolution  with  image-independent  kernels 
cannot  yield  nonlinear  functionals.  In  any  case,  for  convolutions  which  arc  essentially 
matched  filters  subject  to  some  constraints,  the  output  or  such  a  filter,  at  each  point, 
can  be  regarded  as  orthogonal  projection  onto  a  l-dimcnsional  subspacc  of  some  function 
space  (perhaps  after  some  other  linear  operation,  such  as  differentiation).  The  purpose 
or  doing  this  is  to  produce  a  data  structure  which  allows  a  pointwise  decision  procedure, 
e.g.  “is  there  a  zero-crossing  here?"  This  works  when  one  assumes  the  image  comes  from 
a  very  special  subspace,  e.g.  ideal  steps.  Unfortunately,  the  type  of  information  required 
about  an  image  function  cannot  be  adequately  compacted  to  a  single  number  this  way. 
Consequently,  further  operators  must  be  used  to  get  more  information.  Thus  the  feature 
space  is  enlarged,  but  it  is  very  difficult  to  enlarge  it  in  such  a  way  as  to  make  it  easy 
to  solve  the  decision  problem  in  the  feature  space.  To  do  this  properly,  one  needs  first  a 
theory  which  says  something  about  equivalence  of  images.  E.g.,  certain  transformations 
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should  not  affect  the  qualitative  interpretation  or  a  piece  of  the  image.  Another  way  to 
put  this  is  that  convolving  with  some  number  of  kernels,  and  using  the  information  only 
locally,  is  the  same  as  projecting  some  high  dimensional  space  (the  pixel  values)  onto  the 
space  of  kernels.  This  is  essentially  a  standard  problem  of  classification.  Experience  has 
shown  that  even  with  nonlinear  classifiers,  it  is  very  difficult  to  duplicate  the  intuitive 
distinctions  one  seeks.  The  reason  for  this  is  that  such  classification  can  be  incredibly 
complex  unless  it  incorporates  the  structure  which  captures  the  distinctions.  Projection 
onto  an  “edge”  dimension  is  an  attempt  at  this,  but  is  not  enough.  What  is  required 
is  a  more  complete  understanding  of  the  qualitative  shape  of  the  image  function,  based 
perhaps  on  very  nonlinear  predicates,  e.g.  equivalence  under  some  class  of  transformations 
of  the  support. 

“Feature  synthesis”  and  “feature  integration”  require  goodness  of  fit.  So  docs  incorpora¬ 
tion  of  elongated  directional  operators.  “Non-maximum  suppression”  can  be  viewed  in  the 
same  way.  Removal  of  the  response  to  slow  gradients  is  also  a  goodness  of  lit  stratagem, 
in  that  goodness  or  fit  to  a  linear  function  is  sent  to  0.  It  seems  that  the  original  idea 
of  a  single  optimal  operator  has  to  be  mollified  again  and  again  Tor  different  situations. 
Why  is  this?  There  is  really  nothing  wrong  with  the  operator;  the  problem  is  with  the 
problem  that  has  been  posed.  One  can  find  an  operator  that  will  respond  optimally  to  a 
step  edge,  even  Tor  various  definitions  of  optimality  and  noise  process.  The  difficulty  is 
twofold.  First,  the  definition  of  step  edge  is  not  entirely  realistic.  There  is  no  reason  to 
expect  that  a  natural  edge  will  be  well  modelled  by  a  step  between  regions  of  constant 
image  intensity  (or  color).  This  amounts  to  a  Olh  order  approximation  in  the  vicinity  of 
the  edge.  Thi  igs  may  even  be  worse  than  just  the  smooth  fluctuations  one  might  imagine, 
e.g.  near  the  limb  of  an  occluding  object,  one  would  expect  in  principle  the  derivative 
normal  to  the  image  edge  to  grow  to  ±oo  to  a  cusp  (the  intensity  would  stay  finite). 
Thus  the  0th  order  term  would  be  a  bad  approximation  in  any  neighborhood  of  the  edge. 
Deviation  from  constancy  near  the  step  will  affect  at  least  the  localization  from  a  linear 
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integral  operator  with  nontrivial  support.  E.g.  (at  +  b)u-i(-t)  +  (ct + d)u^t(t),  which  is  2 
ramps  of  unequal  slope  separated  at  the  origin  by  a  step,  will  lead  to  incorrect  localiza* 
tion  for  Canny’s  operators,  including  lateral  inhibition.  One  cannot  expect  any  better,  as 
the  feature  space  includes  such  a  function  only  in  the  noise  term.  Of  course,  if  ail  such 
possible  functions  conform  to  the  hypothesis  of  Gaussian  white  noise,  the  operator  will 
still  give  the  best  guess,  on  the  average,  but  that  is  probably  not  what  one  has  in  mind. 

The  second  difficulty  is  that  this  type  of  formulation  answers  the  question  “If  there  is  an 
edge  there,  how  can  1  best  determine  its  parameters?”  But,  first  one  really  wants  to  know 
“is  there  an  edge  there?"  To  answer  that,  though,  besides  knowing  what  “edge”  means 
(not  at  all  a  trivial  matter),  one  must  be  able  to  estimate  how  well  an  edge  hypothesis 
accounts  for  the  data,  compared  to  some  other  hypothesis.  A  distribution  on  one  such 
space  (say  step  edges  plus  white  Gaussian  noise)  does  not  translate  easily  into  another 
distribution  on  what  is  essentially  the  same  space  (say  step  edges  plus  step  ramps  plus 
white  Gaussian  noise).  Again  this  is  the  problem  of  selecting  the  right  feature  space  at 
the  outset. 

Unfortunately,  if  the  step  edge  is  not  ideal,  e.g.  ir  the  values  either  side  of  the  step  are 
not  constant,  localization  based  on  convolution  will  be  inaccurate.  This  is  easy  to  sec 
since  the  extrema  of  even  a  linear  perturbation  g(x)  +  ax  are  geuerically  shifted  from 
those  of  just  g(x).  This  is  the  manifestation  of  an  intrinsic  problem  of  convolution  with 
smooth  kernels:  the  space  of  signals  is  much  too  large  to  be  adequately  classified  by  a 
pointwisc  criterion  (such  as  zero  crossing),  even  for  very  simple  predicates.  Restated,  it 
is  difficult  to  find  integral  approximations  to  point  properties  (unless  one  is  willing  to 
integrate  against  delta  functions,  of  course). 

A  nonlinear  approach 

Wc  claim  that  the  problem  can  be  approached  through  topological  methods..  The  first 
difficulty  is  due  to  the  lack  of  any  smoothness  in  the  noise  process  (as  usually  formulated, 
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the  noise  need  not  be  for  any  k).  This  would  make  a  frontal  assault  by  topological 
methods  difficult,  since  the  natural  settings  for  such  methods  are  spaces  of  differentiable 
functions.  However,  we  can  exploit  a  certain  amount  of  smoothness  which  is  intrinsic  to 
the  data.  I.e.,  the  data  already  incorporate  some  “natural”  smoothing,  whose  effects  in 
any  case  we  would  have  trouble  removing.  So  we  seek  to  get  by  with  (and  exploit)  the 
“minimal”  smoothing,  without  introducing  any  more  confounding  convolution.  We  can 
consider  the  data  as  arising  from  sampling  some  smoothing  process  such  as  bandlimiting, 
Gaussian  convolution,  or  even  something  nonlinear.  Since  the  data  space  (pixel  values)  is 
finite  dimensional  (though  possibly  of  high  dimension),  there  is  a  great  deal  of  collapsing 
in  l’  c  mapping  from  the  infinite  dimensional  input  and  model  spaces.  Therefore,  one 
should  seek  a  regularity  condition  on  the  modelling  function  (i.e.,  the  smooth  function  we 
assume  gave  rise  to  the  data)  which  will  guarantee  robustness.  Put  another  way,  given 
an  equivalence  class  of  functions  yielding  the  same  data,  any  derived  property  should  be 
generic.  Such  a  rigorous  criterion  of  robustness  is  rarely  considered;  instead,  one  simply 
assumes  (based  on  good  reasons)  that,  c.g.,  the  model  function  is  bandlimitcd.  In  that 
case,  Tor  a  line  enough  sample  grid,  there  is  a  unique  (smooth)  model  function.  Since 
we  arc  not  able  to  provide  a  rigorous  analysis  of  robustness  here,  we  just  consider  what 
happens  with  a  smooth  model  function,  assuming  it  has  been  chosen  to  be  generic  (or 
is  unique,  as  with  bandlimiting).  (low  can  we  remove  extra  extrema,  i.e.  do  smoothing, 
without  blurring?  [Kocndcrink  and  van  Doom  1979]  and  [Wilkin  1983]  have  proposed 
convolving  with  a  l-paramctcr  family  of  smoothing  operators,  and  considering  how  con¬ 
tours  of  interest  (say  zero  crossings)  change  as  a  function  of  the  parameter.  One  can  do 
this  in  even  more  generality,  as  is  done  in  singularity  theory.  We  can  consider  generic 
smooth  (-parameter  families  of  smooth  functions,  i.e.  smooth  paths  in  our  model  space. 
In  this  way,  we  arc  led  to  the  theory  or  generic  bifurcations  of  critical  points,  since  these 
play  the  major  role  in  determining  the  topology  of  the  contours.  For  real  functions  in  the 
plane,  this  theory  is  very  well  understood:  the  only  generic  changes  in  the  topology  of  the 
level  set  structure  arc  saddle-node  bifurcations  of  critical  points,  and  saddle-connection 
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nges  in  the  nesting  <St  saddles.  Also,  the  topology  of  an  individual  level  set  can  change 
it  passes  through  a  critical  point.  For  functions  on  the  line,  things  are  even  simpler: 
only  possibility  is  max-min  bifurcation  through  an  inflection  point.  One  can  show 
t  the  critical  point  bifurcations  of  parametrized  Gaussian  smoothing  are  gcnerically 
dlc-nodes  as  well  [Blicher  and  Omohundro  1984],  so  that  the  behavior  which  occurs 
Gaussian  scale  space  is  described  by  the  usual  generic  theory.  This  opens  up  the  pos- 
ility  of  doing  very  non-linear  smoothing  in  a  coherent  (and  simple)  fashion.  Actually, 
cc  one  can  now  understand  the  relationships  among  the  extrema,  the  actual  smoothing 
>robably  unnecessary.  The  important  thing  is  that  the  smoothing  must  proceed  by 
:  annihilation  of  a  saddle  with  an  adjacent  node  (extremum),  or  the  swapping  of  saddle 
itings  by  the  saddle  connection.  This  is  how  any  smoothing  must  work.  Exactly  in 
at  order  these  things  happen  for  a  given  function  depends  on  the  type  of  smoothing  and 
tperties  of  the  function.  Sometimes,  this  can  happen  in  ways  that  are  not  very  useful, 
cc  spatial  extent  and  amplitude  are  interchangeable  in  a  linear  (integrating)  operator, 
is  probably  more  useful  to  know  things  like  the  heights,  supports,  and  proximities 
critical  point  domains  independently.  Then  certain  kinds  might  be  readily  smoothed 
ilc  others  might  not,  depending  on  some  interpretation  heuristic.  E.g.  even  a  very 
ge  spike,  if  of  tiny  support,  could  be  removed  (by  excision-  i.c.  without  alTccting 
irby  data),  or  small  bumps  on  a  big  bu.np  could  be  regarded  as  such. 
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Global  Methods 

Accumulator  array* 


Hough  1962 

“ Method  and  Means  for  recognizing  complex  patterns”  (Duda  and  Hart  1971,  Duda  and 
Hart  1972,  Duda  and  Hart  197SJ 

The  Hough  technique  offers  a  solution  to  the  problem  of  finding  global  straight  lines,  or 
more  exactly,  finding  global  sets  of  nearly  collincar  feature  points.  In  the  present  context, 
“global”  means  over  the  entire  image,  though  other  workers  have  used  the  same  idea  for 
subregions. 

Basic  idea 

Consider  L,  the  set  of  all  lines  in  the  plane,  as  a  topological  space.  Duda  and  Hart 
use  the  so-called  normal  parametrixation  for  L,  where  each  line  is  specified  by  the 
pair  (M.  representing  the  orientation  and  distance  from  the  origin  of  the  line.  This 
parametrixation  is  borrowed  from  integral  geometry,  where  it  is  used  in  the  solution  of 
the  Buffon’s  needle  problem.  It  derives  its  utility  from  providing  a  translation  invariant 
measure  for  the  space,  so  that  probabilities  behave  in  desired  ways.  ([Santalo  1976)  is 
an  excellent  source  for  information  about  integral  geometry,  and  should  be  of  interest  to 
vision  researchers.)  Hough,  on  the  other  hand,  used  the  slope-intercept  parametrixation 
familiar  from  analytic  geometry,  but  which  is  fraught  with  difficulties  for  this  situation. 
Incidentally  L  is  a  non-trivial  space:  for  p  >  0,  every  value  0  <  9  <  2x,  defines  a 
different  line.  But  when  p  =  0,  i.e.  for  lines  through  the  origin,  {9,  p)  defines  the  same 
line  as  (9  +  x,p).  Thus,  L  is  homcomorphic  to  a  semi-infinite  circular  cylinder  with  the 
bounded  end  terminated  so  that  antipodal  points  on  the  cross-section  circle  arc  identified, 
which  in  turn  is  homcomorphic  to  a  punctured  disk  with  antipodal  points  on  its  periphery 
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ed.  This  is  also  <he  same  thing  as  an  infinite  Mobius  strip,  formed  by  talcing  a 
infinite  strip  and  gluing  it  together  with  a  half-twist. 

isic  insight  Hough  used  is  this.  For  each  point  p  in  the  plane,  there  is  some  curve 
]  L  which  corresponds  to  all  the  lines  through  p.  For  each  p  of  interest  in  the 
accumulate  weight  for  7 (p)  in  L.  Then  lines  in  the  picture  will  be  places  in  L 
igh  accumulated  values.  (One  can  think  of  this  as  defining  a  weight  accumulation 
m  h  :  L  -»  R  by  h  =  X)x-|(p,)  where  X-rfo)  18  the  characteristic  function  or  7 (p<). 

da  and  Hart  point  out,  the  method  provides  a  savings  because  of  quantisation  of 
e  finer  the  quantization,  the  less  the  savings. 

ition 

ough  method  is  not  adaptable  beyond  very  limited  spaces  of  curves  because  storage 
iments  grow  exponentially  with  the  number  of  parameters  characterizing  the  fea- 
i.c.  with  the  dimension  of  the  space  of  curves. 

the  method  is  totally  global,  undcsircd  features  can  come  into  play,  i.c.  the  noise 
1  high  due  to  many  chance  contributions  throughout  the  image.  However,  to  combat 
-oblcm,  one  can  design  localized  variations,  at  the  price  of  requiring  a  method  to 
the  local  results  together. 

Ic  generalizations  include  the  detection  of  curves  with  more  parameters,  weighted 
ulalion  (based  on  confidence  or  significance  of  the  data  points),  the  inclusion  of 
:onsidcralions,  and  localisation. 

icccss  of  the  Hough  method  is  very  dependent  on  selecting  the  initial  points  of 
t,  i.c.  on  the  local  feature  operator.  On  the  other  hand,  a  good  way  of  doing  this 
compensate  for  large  parameter  spaces. 
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Ballard  and  Sklansky  1976 

“A  ladder- structured  decision  tree  for  recognizing  tumors  in  chest  radiographs’ 

The  authors  are  concerned  with  finding  roughty  circular  regions  of  approximately  100 
pixels  in  area. 

Summary  of  processing  steps 

A  thrcsholdcd  gradient  picture  is  first  arrived  at,  using  a  sequence  of  processes  Or  o  V  o 
M  o  L,  where 

L  is  the  low  pass  filter  operation  defined  by  averaging  and  then  resampling  on  a  coarser 
grid.  (Note  that  averaging  is  not  strictly  low  pass,  since  the  filter  transfer  function 
is  a  sine.) 

)t  is  a  high  pass  filtering  operation  performed  in  the  frequency  domain  via  l'TT’s,  using 
a  filler  characteristic  attributed  to  Kruger. 

V  is  a  digital  gradient  operator  defined  in  terms  or  adjacent  pixel  differences. 

Or  is  a  global  thresholding  operator. 

A  heuristic  search  connecting  edge  pixels,  similar  to  Martclli’s  technique,  is  then  used  to 
find  the  lung  area,  following  a  Kelly-like  “plan.” 

To  locate  tumors  and  nodules,  a  IIough-like  method  is  used:  An  accumulator  array 
corresponding  to  possible  circles,  indexed  by  position  and  radius  is  incremented  by  the 
number  or  edge  pixels  with  positions  and  gradient  directions  consistent  with  lying  on  the 
given  circle.  An  improvement  is  achieved  by  using  the  gradient  direction  in  addition  to 
the  magnitude. 

Rig  and  small  radii  arc  tumor  and  nodule  candidates,  resp.  The  big  ones  arc  immediately 
declared  to  be  tumors,  while  the  -candidate  nodules  arc  subjected  to  a  2  stage  classifier 
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which  looks  at  features* from  a  detailed  nodule  boundary  finder.  The  latter  is  based  on 
growing  ail  optimal  edges  of  length  n  in  a  given  region  until  closure  is  reached,  using  a 
Kelly-like  “plan." 


Evaluation 


The  accumulator  array  method  seems  to  be  useful  for  finding  some  circic-likc  boundaries. 
One  must  always-  keep  in  mind  2  questions  for  such  feature  detectors:  what  docs  it  really 
find,  and  what  will  it  miss.  These  questions  arc  best  answered  either  by  mathematical 
proof  or  application  to  numerous  examples.  Unfortunately,  neither  of  these  tests  is 
available  in  the  present  paper,  though  probably  one  cannot  fault  the  authors  for  not 
including  more  examples,  since  space  limitations  may  have  been  imposed.  In  any  case, 
what  is  being  detected  is  not  regions  with  roughly  circular  boundaries,  but  areas  having 
a  sufficiently  high  count  of  above  threshold  gradient  values  (or  the  right  orientation)  lying 
on  a  circle.  This  provides  some  kind  of  global  understanding  or  the  intensity  function, 
which  is  commendable,  but  it  is  not  likely  to  find  sharp  edges  which  do  not  stay  near  and 
tangent  to  some  such  circle.  However,  the  main  use  in  the  paper  being  reviewed  is  to  guide 
a  more  detailed  process  of  boundary  finding,  and  in.  that  context  the  question  becomes 
whether  the  feature  being  found  is  indicative  or  a  closed  boundary  in  its  vicinity.  On 
the  one  hand,  there  is  little  doubt  that  a  roughly  circular  boundary  or  adequate  contrast, 
sufficiently  dcfocusscd  would  cause  one  or  a  few  such  circle  detectors  to  fire,  allowing  the 
more  detailed  process  to  find  the  precise  boundary.  On  the  other  hand,  the  firing  of  a 
circle  detector  is  no  guarantee  that  there  must  be  such  a  boundary:  all  that  is  necessary 
is  that  the  intensity  have  a  steep  enough  centripetal  gradient  over  a  large  enough  part  of 
a  circle,  which  might  happen  if  the  intensity  function  has  a  maximum  inside  the  circle. 
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Region  growing 

Brice  and  Fennema  1970,  Fennema  and  Brice  1970 
“Scene  Analysis  Using  Regions " 

This  is  now  a  classic  work  in  region  growing.  Its  methods  arc  extremely  simple,  which  a 
priori  may  not  be  an  indictment,  but  in  this  case  they  are  based  on  an  overly  simplistic 
image  model  that  no  one  now  believes.  The  approach  was  motivated  purely  by  heuristics, 
rather  than  any  theory,  and  at  this  level  of  processing  that  turns  out  to  be  inadequate. 

The  basic  segmentation  operation  is  to  partition  the  image  by  pixel  intensity  value.  The 
authors  use  boundary  predicates  which  arc  based  on  a  completely  local  measure:  nearest 
neighbor  intensity  differences. 

There  are  2  merging  heuristics: 

Phagocyte  heuristic 

Merge  iuljacenl  regions  if  the  “weak" 
part  of  one  of  their  total  boundaries, 
thresholds. 

Weakness  heuristic 

Merge  adjacent  regions  if  the  “weak”  part  or  their  common  boundary  is  a  big  enough 
fraction  of  it  (common  boundary).  Another  global  threshold  is  used  for  “big  enough.” 

Eialnaiigq 

The  method  presented  is  much  too  simplistic.  E.g.,  it  will  clearly  fail  if  smooth  shading 
leads  to  1st  differences  of  the  same  magnitude  ns  an  edge.  Noise  spikes  will  always  end 
up  as  regions.  The  heuristics  arc  too  heuristic  -  they  arc  not  based  on  any  analysis  or 


part  of  their  common  boundary  is  a  big  enough 
“Weak"  and  “big  enough”  arc  relative  to  global 
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understanding  of  real  images,  beyond  a  few  common-sense  notions.  Global  thresholds  are 
invariably  a  bad  idea:  a  little  observation  can  persuade  one  that  the  same  magnitude  (of 
edge  parameter,  gradient,  or  whatever)  can  be  significant  in  one  context  and  meaningless 
in  another. 

Kirsch  1971 

“Computer  Determination  of  the  Constituent  Structure  of  Biological  Images " 

The  author  indicates  that  lie  is  interested  in  image  processing  as  deriving  data  structures 
from  image  data. 

lie  differentiates  between  “well-defined  objects"  and  “aggregates,"  which  is  essentially 
the  difference  between  smoothly  shaded  objects  with  smooth  boundaries,  and  textured 
“objects”  with  texture  boundaries.  He  suggests,  among  other  examples,  that  cells  are 
well-defined  objects,  while  tissues  arc  aggregates. 

The  goal  is  to  find  boundaries  for  both  types  of  objects,  and  the  approach  is  via  a  local 
contrast  function  which  is  based  on  the  use  of  the  convolution  masks 

5  5  5  —3  5  5 

-3  0  -3  -3  0  5 

-3  -3  -3  -3  -3  -3 

and  their  00°  rotations.  The  local  contrast  function  C  is  then  defined  as  the  local 
max  over  all  masks  of  the  absolute  values  from  the  convolutions,  lie  defines  a  blob  of 
heterogeneity  K  as  (our  equivalent  definition)  a  connected  region  II  such  that  <  K 

and  C\br  >  K:  basically  a  low  contrast  region  with  a  high  contrast  boundary,  with 
“low,”  “high”  defined  by  the  threshold  K. 

The  data  structure  he  derives  is  based  on  the  observation  that  varying  the  threshold 
induces  a  partial  order  on  the  regions  by  inclusion,  which  is  of  course  a  functorial 
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consequence  of  the  natural  ordering  on  the  thresholds.  This  partial  order  he  represents 
as  a  tree,  and  as  a  reduced  tree  showing  only  when  regions  coalesce. 

Evaluation 

The  data  structure  which  Kirsch  proposes  is  interesting  in  that  it  is  essentially  the 
structure  of  the  level  sets  of  the  contrast  function  he  uses.  As  we  point  out  elsewhere,  the 
level  set  structure  of  a  function  captures  the  topologically  invariant  information.  In  this 
case,  however,  the  preprocessing  steps  leading  to  this  structure  arc  hcurislieally  based 
and  unfortunately  the  invariant  features  arc  not  adequately  studied,  and  the  effects  of 
noise  on  the  structure  arc  not  taken  into  consideration.  The  author  cannot  really  be 
much  faulted  for  this,  as  the  mathematics  involved  was  not  very  widely  known  at  the 
time. 

The  result  is  a  technique  which  is  better  than  just  intensity  thresholding,  but  suffers  many 
of  the  same  drawbacks.  Although  he  keeps  track  of  what  may  happen  for  all  threshold 
values,  the  thresholds  arc  still  global  thresholds,  although  one  could  generalise  slightly 
and  use  thresholds  global  only  to  a  region.  Now  although  one  might  expect  boundary 
contrast  to  be  less  variable  over  some  region  than  simple  intensity,  it’s  easy  to  imagine, 
c.g.,  a  weak  spot  in  a  boundary  such  that  lowering  the  threshold  to  include  the  weak  spot 
introduces  enough  boundary  points  to  disconnect  the  region. 

Because  only  the  max  of  the  directional  contrasts  is  used,  important  geometric  informa¬ 
tion  is  discarded  in  finding  the  boundaries.  This  is  apt  to  lead  to  errors  Tor  uniform 
regions,  since  noise  cannot  be  rejected  on  the  basis  of  direction  to  other  boundary  points. 
The  cITcct  for  textured  regions  is  hard  to  evaluate;  in  some  cases  it  may  be  helpful,  but 
it  seems  unlikely  to  work  alone. 

No  justification  is  given  for  the  values  in  the  convolution  mask.  For  the  purpose  of 
detecting  a  step  edge  in  the  presence  of  Gaussian  noise,  it  is  not  the  most  sensitive. 
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Not  enough  experimental  data  is  presented  to  give  any  fee)  for  performance  on  real  images. 
Somerville  and  Mundy  1976 

* One  Pats  Contouring  of  Images  Through  Planar  Approximation * 

The  authors  state  that  they  are  interested  in  finding  contours  of  the  intensity  function, 
but  what  they  mean  by  this  is  finding  places  of  large  change  of  gradient. 

Their  first  goal  is  to  represent  the  picture  data  compactly  for  further  processing.  This  is  a 
good  idea,  since  it  is  necessary  to  have  a  representation  of  the  intensity  surface  for  varying 
neighborhoods— not  just  single  pixels.  An  important  reason  Tor  doing  so,  which  they  do 
not  mention,  is  to  synthesize  semi-global  but  accurate  information.  (By  semi-global,  we 
mean  regions  larger  than  a  single  pixel  or  pair  of  pixels,  yet  smaller,  usually  much  smaller, 
than  the  entire  image.) 

The  primitive  regions  to  be  used  for  region  growing  arc  triangles.  These  arc  initially 
formed  by  drawing  diagonals  for  each  set  of  A  points  so  ns  to  keep  similar  intensities 
together.  The  region  growing  is  then  done  by  a  process  of  raster  scan  local  merging.  The 
merging  criterion  is  as  follows. 

1)  Compute  the  normal  vector  of  the  intensity  function  on  the  next  triangle.  This  is 
not  normalized  it  is  actually  the  3-dimcnsional  gradient. 

2)  Compare  with  the  current  average  normal  vector  for  each  whole  adjacent  region. 

3)  Merge  the  triangle  into  the  region  if  the  magnitude  or  the  vector  error  (|nr  —  n«|) 
is  less  than  a  threshold  based  on  region  size: 


=  kie~ktA  +  k3 
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This  can  be  criticized  as  follows. 

2)  Presumably,  they  do  this  because  they  want  regions  of  uniform  normal.  But  it  seems 
more  reasonable  to  compare  normals  locally,  leading  to  locally  uniform  normal,  i.e. 
regions  of  slowly  changing  normal. 

3)  The  adaptive  threshold  is  not  well  justified.  The  stated  purpose  is  noise  immunity — 
presumably,  values  for  large  regions  should  be  more  stable,  so  there  arc  2  terms,  one 
for  the  region  noise,  one  for  the  triangle  noise,  though  this  is  not  explicitly  stated. 
Since  the  gradient  is  a  linear  operator,  one  could  in  fact  explicitly  solve  for  the  noise 
characteristics  or  the  expected  dilTcrencc  in  normals.  The  region  component  would 
be  of  the  form  o  —  koo/y/X,  and  in  fact  the  standard  deviation  of  error  in  normals 
is  given  by 


where  k 2  is  the  mean  square  contribution  of  each  pixel  in  the  region  to  the  expres¬ 
sion  for  the  normal,  and  c2  is  the  analogous  quantity  for  a  single  triangle.  In  this 
light,  the  threshold  adopted  by  the  authors  is  seen  to  be  a  linearization  and  exponen¬ 
tial  approximation  to  this  function,  for  a  fixed  standard  deviation  of  image  noise. 
Furthermore,  since  the  merging  is  done  on  a  raster  scan,  the  merging  predicate  will 
result  in  different  behavior  near  the  tops  of  regions  ns  compared  to  the  bottoms.  Not 
only  that,  but  this  can  happen  in  a  discontinous  way,  when  2  regions  suddenly  get 
merged. 

The  entire  process  is  equivalent  to  edge  detection  based  on  computing  a  gradient  from 
x  and  y  first  differences.  However,  the  edge  predicate  is  adaptive  in  the  sense  that  the 
threshold  is  based  on  the  mean  gradient  of  adjacent  regions  (in  this  case,  only  of  regions 
above,  i.e.  earlier  in  scanning).  The  adaptive  part  isn’t  a  bad  idea,  but  using  an  operator 
with  a  support  of  3  will  lead  to  noise  problems,  as  well  as  problems  with  discerning  larger 
scale  features. 
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Experimental  results 

A  single  example  on  a  84  X  48  X  8  picture  is  given.  A  reconstruction  of  the  original  is 
presented,  based  on  linear  interpolation  about  the  centroid  of  each  region.  This  result  is 
not  impressive.  The  authors  arc  concerned  with  data  compression  and  reconstructibility, 
but  from  the  point  of  view  of  image  understanding,  reconstructibility  should  not  be  seen 
as  a  measure  of  performance.  The  region  boundaries  displayed  do  not  appear  significantly 
better  than  other,  local,  methods.  It  would  be  interesting  to  see  the  results  of  a  process 
incorporating  the  improvements  suggested  above,  vis. 

•  gradients  computed  for  larger  neighborhoods 

•  thresholds  based  on  picture  noise  and  exact  formulas 

•  merging  based  on  local  information.  Alternately,  one  could  iterate  Liking  gradients. 

•  some  isotropic  merging  process  (which  might  result  in  the  requirement  for  more  than 


Even  bo,  the  gradient  idea  leads  to  difficulties  if  an  edge  should  pnss  through  the  operator 
support- one  might  get  many  regions  perpendicular  to  the  edge,  elongated  along  the 
edge,  but  broken  up  as  the  geometry  of  the  edge  changes  in  other  words,  poor  behavior. 
A  plane  is  too  simple  a  model  for  the  local  intensity  surface. 

Hiatogramming 

Ohlander  1975 
"Analytu  of  Natural  Scene t * 

The  author  docs  region  growing  based  on  analysing  histograms  of  9  color  image 
parameters:  the  3  raw  R,  G,  B  values,  as  well  as  the  derived  parameters  of  intensity, 
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hue,  saturation  and  the  Y,  I,  Q  parameters  used  in  color  signal  coding  techniques.  In¬ 
addition,  values  of  their  gradient  as  found  by  a  Sobci  operator  arc  used,  as  is  the  local 
density  of  points  above  threshold  in  the  gradient  picture,  called  the  “business  matrix.”  He 
performs  shrinking  and  expansion  on  the  business  matrix  to  eliminate  thin  regions  (i.e., 
non-texture  edges).  The  histogram  analysis  is  based  on  a  simple  heuristic,  and  sometimes 
is  done  with  manual  intervention.  Regions  arc  found  by  thresholding. 

Evaluation 

The  technique  of  thresholding  based  on  features  of  histograms  ignores  any  geometric 
relations  in  the  data  (a  random  permutation  or  the  position  of  pixels  doesn’t  change  the 
histogram).  Similarly,  it  takes  no  account  of  the  photometric  properties  or  the  real  world. 
These  problems  aside,  the  use  of  9  1-dimcnsional  histograms  is  still  somewhat  naive, 
since  the  pixel  space  is  only  3-dirncnsional.  ft  would  be  more  systematic  to  use  clustering 
techniques  in  some  3-dimcnsional  color  space  (which  have  an  extensive  literature)  instead 
of  9  somewhat  arbitrary  t-dimcnsional  projections. 

This  method  can  be  expected  to  work  on  images  that  happen  to  be  amenable  to  it, 
i.e.  ones  where  the  regions  are  pretty  much  homogeneous  and  separable  from  others 
by  hislogramming.  Looking  at  the  technique  as  a  clustering  approach,  regions  can  be 
segmented  only  if  their  3-dimcnsional  pixel  color  values  can  be  separated  by  one  of  9 
families  of  parallel  planes  in  R3,  the  planes  perpendicular  to  the  9  coordinate  axes  used. 
This  docs  not  even  allow  for  separability  by  an  arbitrary  plane  in  R3,  and  the  latter  is 
known  to  already  be  an  overly  restrictive  condition  for  most  clustering  problems. 

Shafer  1980 

"MOOSE.  Uters’  Manual,  Implementation  Guide,  Evaluation * 

Shafer  describes  a  system  following  Ohlandcr’s  technique  of  image  segmentation  by  the 
use  of  multi-  spectral  histograms.  The  implementation  is  essentially  automatic,  and 
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reasonably  fast  (30  seconds  on  a  PDP-10  to  segment  a  96  X  128  image,  and  20-25  minutes 
total  time  with  all  displays).  See  the  remarks  about  Ohlander’s  work  regarding  the 
histogram  mi  ng  technique. 

The  author  himself  provides  some  criticism  of  the  technique.  The  main  shortcoming 
pointed  out  is  referred  to  as  the  “majority  rule”  problem,  which  occurs  when  the  his¬ 
togram  peak  separation  process  is  dominated  by  large  regions.  In  that  case,  if  a  small 
region  happens  to  be  situated  in  a  narrow  valley  between  the  large  regions  (i.c.  the  large 
histograms  nearly  overlap),  the  small  region  will  be  broken  in  two  arbitrarily.  This  is 
a  consequence  of  the  Tact  that  histogram  mi  ng  ignores  geometric  relationships.  The  solu¬ 
tion  proposed  is  to  first  crop  the  picture  so  that  a  small  region  to  be  segmented  from 
its  surround  becomes  a  large  region  in  the  sub-picture.  Of  course,  this  amounts  to  an 
approximate  segmentation.  No  method  is  proposed  to  do  this  automatically,  though  the 
author  argues  that  the  cropping  idea  is  robust  by  showing  that  including  some  other 
objects  in  the  cropped  area  still  allows  reasonable  performance.  This  seems  to  indicate 
that  histogramming  works  belter  for  very  small  pictures.  A  seductive  idea  (not  suggested 
by  the  author)  is  to  try  arbitrarily  subdividing  the  picture  and  simply  segmenting  the 
smaller  pictures.  Unfortunately,  this  will  create  non-  trivial  problems  in  merging  regions 
across  subpicturc  boundaries.  In  view  of  the  many  shortcomings  of  histogramming  and 
thresholding  techniques,  it  does  not  seem  worthwhile  to  pursue  improvements. 

The  author  also  points  out  the  following  problems.  Many  small  areas  at  the  boundary  of 
a  region  arc  lost  since  the  boundary  is  sensitive  to  the  threshold.  He  suggests  the  solution 
of  merging  them  after  other  segmentation  is  complete.  Regions  or  non-constant  intensity 
cannot  be  handled,  i.c.  the  technique  fails  in  the  presence  or  any  shading.  Strangely,  he 
points  out  that  the  gradient  requires  2  parameters  Tor  description,  but  he  docs  not  know 
how  to  express  this  in  “one-dimensional  features.”  Presumably,  he  means  he  wants  to 
histogram  the  gradient  somehow,  but  using  Ohlandcr’s  methods  means  selecting  a  single 
parameter  to  histogram  and  threshold.  In  analogy,  c.g.  to  Ohlander’s  use  of  R  +  G  +  B, 
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gradient  magnitude  sccfns  like  a  reasonable  candidate  for  one  such  parameter,  and  it  is: 
unclear  why  the  author  neglects  it. 

The  eventual  goal  of  this  system  is  for  use  in  an  object  tracking  system.  One  might 
hope  that  even  if  one  couldn’t  overcome  the  problems  of  segmenting  a  single  image,  the 
segmentation  would  at  least  be  stable  from  frame  to  frame.  This  seems  to  be  a  false 
hope.  Thresholding  can  be  thought  of  as  creating  boundaries  where  some  level  plane 
intersects  the  image  parameter  value  function,  so  that  different  thresholds  correspond 
to  different  height  contours  on  a  topographic  map.  At  boundaries  with  small  gradient, 
geometry  will  change  rapidly  with  threshold  value;  and  at  peaks,  valleys,  and  saddles 
there  will  be  a  change  in  topology  as  a  function  of  threshold  value.  If  this  function  has 
lots  of  bumps,  and  if  it  is  changing  with  time,  then  there  is  a  serious  problem  of  keeping 
track  of  what  is  going  on.  The  problem  becomes  one  of  keeping  track  of  the  topological 
structure  of  the  whole  parameter  function,  particularly  its  singularity  bifurcations,  but 
this  cannot  be  done  by  simply  applying  a  single  threshold,  unless  the  regions  created  this 
way  arc  stable  over  large  intervals  of  time.  What  can  reasonably  be  expected  to  be  so 
stable?  An  object  with  regions  of  constant  parameter  value  (shading  is  tolerable  in  a 
system  which  looks  at  hue,  as  long  jus  hue  is  constant-  an  admittedly  unlikely  situation), 
moving  through  light  in  such  a  way  that  the  reflcctjince  changes  very  slowly  relative  to  the 
motion,  jigainst  a  background  having  very  different  spectral  characteristics,  occurring  in 
an  image  where  everything  else  also  has  different  spectral  characteristics  than  the  object 
and  its  background.  This  appears  to  be  a  very  limited  domain,  though  there  may  be 
useful  applications,  nevertheless,  c.g.  in  an  artificia.1  environment  like  an  assembly  line, 
where  those  parameters  can  be  controlled. 

Optimal  linking 

Montanari  1970,  Montan ari  1971 

"On  the  Optimal  Detection  of  Curves  in  Noisy  Pictures " 
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The  author  presents  a*  nonserial  dynamic  programming  approach  to  find  optimal  8- 
connected  paths  of  a  fixed  length  on  a  grid,  and  suggests  a  generalization  which  permits 
arbitrary  length  curves.  Examples  are  displayed  with  mean  square  noise  =  mean  square 
signal  ( S/N  =  1)  of  length  45,  with  good  results,  though  the  examples  are  not  related 
to  real  images.  “Optimal”  is  with  respect  to  a  figure  of  merit  (FOM);  he  uses  one  based 
on  ^intensity  —  £ curvature  (he  is  primarily  interested  in  curves  which  arise  in  the 
character  recognition  domain). 

Evaluation 

Using  Montanari’s  method  as  an  edge  detector  requires  developing  an  appropriate  FOM. 
This  is  difficult,  unless  there  is  a  canonical  FOM  imposed  by  the  problem,  since  an  FOM 
is  not  robust  in  the  following  sense.  Viz.,  FOM’s  which  arc  monotonic  functions  of  each 
other  (and  as  regular  as  you  like)  can  give  different  global  optima.  For  edge  detection,  to 
the  extent  that  one  can  estimate  the  probability  that  local  data  were  caused  by  an  edge, 
one  can  use  an  FOM  based  on  the  relative  probability  of  the  curve,  so  there  is  promise. 

The  requirement  that  the  curve  be  an  8-conncctcd  path  on  a  grid  is  troublesome,  since  one 
would  prefer  smooth  curves  as  solutions.  There  is  no  easy  way  to  translate  the  optimal 
path  to  a  set  of  parameters  representing  a  smooth  curve,  aside  from  an  independent 
fitting  process.  Also,  it  is  difficult  to  lake  into  account  any  but  the  most  local  properties 
of  the  curve  one  is  fitting,  if  Tor  no  other  reason  than  the  prohibitively  large  growth  of 
the  dimension  or  the  interaction  graph  for  the  dynamic  programming  problem. 

Although  one  is  guaranteed  an  optimum  for  the  FOM,  it  is  not  certain  that  one  necessarily 
wants  such  an  optimum  for  image  understanding  applications,  at  least  if  the  FOM  is 
totally  decomposable  into  spatially  local  components.  The  curve  one  is  looking  for  is 
one  which  is  the  most  meaningful  in  the  context  of  the  entire  image  intensity  function 
(and  world  knowledge,  psychological  set,  etc.),  and  this  meaning  may  depend  on  data 
away  from  the  curve,  which  would  lead  to  an  intractable  interaction  graph  for  a  naive 
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The  condition  that  the  t  derivatives  be  nonzero  when  s  is  CT  is  there  for  2  reasons:  First, 
so  that  the  definition  reduce  to  the  C°  case,  which  is  intuitively  the  meaning  of  “zero 
crossing."  And  secondly,  to  avoid  degenerate  cases,  c.g.  when  the  locus  of  zeroes  is  a 
submanifold  perpendicular  to  the  z-axis  in  the  x,  y,  8  space,  i.e.  when  the  zero  locus  is 
tangent  to  the  8  direction.  Note  that  in  this  case,  a  zero  can  still  be  a  regular  point  of 
s.  Conversely,  even  if  s  is  CT,  the  C°  condition  is  weaker,  since  c.g.  it  doesn’t  exclude 
tangent  crossings. 

Theorem,  a  :  R2  X  S1  — ♦  R  '•annot  have  an  isolated  zero  crossing  in  cither  of  the  above 
senses.  (By  isolated  we  mean  there  arc  no  other  zeroes  in  the  x,0  manifold,  for  fixed  y.) 
That  is,  edges  cannot  be  localized  simultaneously  in  x  and  8  by  the  zero  crossings  of  a 
single  (8- parametrized)  convolution  operator. 

Proof. 

Case  1:  a  of  class  CT ,  r  >  1 

Since  (z,  y,8)  is  a  regular  point  of  a,  the  implicit  function  theorem  applies  and  in  some 
neigh borhood  of  (x,y,0),  a-,(0)  is  a  (7r  submanifold  of  dimension  2.  The  conditions  on 
the  parlials  guarantee  that  the  surface  is  not  normal  to  any  or  the  z,  y,  or  8  axes,  so 
that  for  fixed  y,  there  is  a  curve  of  ( x,0 )  values  for  which  s(z,  y,  0 )  =  0,  so  that  the  zero 
cannot  be  localized  in  z  and  8  simultaneously.  A  more  direct  way  to  sec  this  is  to  observe 
that  what  we  arc  seeking  is  a  function  a  whose  zero  crossings  arc  the  locus  of  an  edge. 
Regarding  the  edge  as  a  function  7  :  R  — »  R2,  it’s  obvious  that  adding  orientation  leads 
to  a  function  X  :  R  -♦  R2  X  S'  defined  by  X(<)  =  (^(t),8[t)),  where  0{t)  is  the  orientation 
of  the  edge  at  7 (1).  Since  the  image  of  X  is  1-dimcnsional,  we  cannot  hope  for  it  to  be 
the  inverse  image  of  a  regular  value  or  a  map  to  the  reals,  since  by  the  implicit  function 
theorem,  that  must  be  a  2-dirncnsional  object.  But  by  the  same  token,  if  we  have  instead 
s  :  R3  -»  R2,  then  one  can  try  to  find  edges  by  finding  s~'(0).  QED  Case  1. 
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The  Limitations  of  Tfero-crosaings 


Definition  of  zero  crossings 

Il's  sometimes  handy  to  have  a  notation  for  the  function  you  get  by  holding  the  arguments  of  some  other 
function  fixed.  We  will  use  the  notation  /(-,]/)  for  the  function  that  results  from  fixing  the  2nd  argument 
of  the  function  /  to  be  the  value  y.  The  dot  represents  an  argument  position  to  be  lilted.  The  purpose 
is  to  use  a  notation  like  /(z,  y)  while  avoiding  the  confusion  of  whether  it  is  z,  y,  or  both  which  are 
the  variable,  since  each  of  these  cases  actually  correspond  to  a  differc.  '■  function  object.  More  precisely, 
suppose  that 

f:XxY-Z 

(*.  y)  *-* « 

so  that  /(z,  y)  =  z.  Now  define 

f(;v):X->Z 

'•e.,  /(-,  wK*)  =  /(*.  W)  =  *• 

IF  s  is  C°  (continuous),  then  wc  will  say  it  has  a  zero  crossing  at  (x,  y,  9)  if  the  functions 
s(-,y,  9)  and  s(x,  y,  •)  both  have  1-dimensional  zero  crossings  at  z  and  9,  respectively. 
Colloquially,  this  incans  that  the  x  and  0  functions  have  zero  crossings.  Wc  don’t  require 
a  zero  crossing  in  the  y  direction,  because  it  may  be  the  direction  of  the  edge.  Wc  will 
say  that  /  :  R  -»  R  has  a  1-dimensional  zero  crossing  at  x  if  f(x)  =  0,  z  is  the  only 
zero  in  some  neighborhood,  and  /  has  opposite  signs  on  opposite  sides  of  x  in  such  a 
neighborhood. 

If  a  is  CT,  r  >  1,  then  we  will  say  that  a  has  a  zero  crossing  at  (x,  y,  9)  if  »(x,y,9)  =  0 
and  l)\s(x,y,9)  ^  0  ^  l)&s(x, y, 0),  where  D»  indicates  the  derivative  with  respect  to 
the  i-lli  coordinate.  Thus  ( x,y,9 )  is  a  regular  point  or  a,  which  means  that  not  all  of  its 
partials  arc  0  at  that  point.  This  implies  the  C°  definition  of  zero  crossing. 

Remarks  on  the  definition  of  zero  crossings 

The  picture  wc  arc  keeping  in  mind  has  the  edge  oriented  along  the  y-axis.  The  definitions 
seem  to  single  out  a  particular  set  of  coordinates  asymmetrically,  to  keep  with  this  picture. 
However,  the  definitions  really  only  require  that  the  x  and  9  axes  not  be  oriented  along 
the  edge;  equivalently  wc  could  have  required  that  some  set  of  coordinate  axes  with  these 
properties  exist. 


.9 
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2)  For  each  y  €U,  each'partial  D,f(x,  y)  (taken  with  respect  to  the  y-th  y- variable)  is  in 

i'M- 

3)  There  exists  a  function  /i  G  L‘(p)  such  that  for  all  y  G  U, 

\Djf(xtV)\  <  l/i(x)|. 


«*»(!/)  =  /  f(x,  y)  d/i(x). 

J  X 


Then  Dj<t>  exists  and  wc  have 


Djd>{y)  =  J  DJf(x,y)dli(x). 


The  Icmina  permits  us  to  conclude  the  following 

Theorem  [Lang  I960].  Let  /  G  t*  and  ip  G  Cr,  r  >  1  with  compact  support.  Then 
/  *  <P  G  CT  and  Dp{f  *  <p)  =  f  *  Dpip  for  p  <  r. 

Notice  that  this  means  that  no  matter  how  badly  behaved  /  may  be,  f  *  <p  is  as 
dilTcrcn liable  as  <p.  In  particular,  convolution  with  a  C"0  function  results  in  a  C°°  func¬ 
tion.  In  our  situation,  if  cither  the  picture  or  the  convolution  kernel  is  dilTcrcntiablc  with 
respect  to  the  parameters  (wc  may  interchange  the  two,  allowing  the  symmetries  to  act  on 
the  picture  if  it  suits  us),  then  our  function  s,  the  convolution,  is  likewise  dilTcrenliabic. 
If  wc  have  a  differentiable  kernel,  then  wc  can  take  U  to  be  an  open  set  in  the  parameter 
space  of  the  kernel,  and  X  to  be  the  picture  plane.  If  instead,  wc  have  a  differentiable 
picture,  wc  reverse  the  roles  of  U  and  X.  On  the  other  hand  it  may  happen  that  both  the 
kernel  and  picture  contain  discontinuities,  c.g.  if  they  have  steps.  In  that  case,  integration 
by  parts  yields  the  fact  that  the  convolution,  i.c.  a,  is  continuous. 
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Using  all  the  notations*  we  can  include  rotations  by  allowing  G  to  be  the  rigid  motion 
(Euclidean)  group  of  R3,  and  considering  the  functions  defined  by 

S(K,F,x,9)  =  K$*F(x)  =  (T(Xit](K),F)  =  (Trm.„„(K),  F)  =  j  KoPj'orloT~l-FdA 

We  are  interested  in  the  function  obtained  from  K$*F(x)  by  fixing  K,  F:  this  is  a  function 
*  :  R3  X  S1  — ►  R,  i.e.  a  function  of  x,0.  I.e.,  we  define  s  by  a(x,0)  =  S(K,F,x,0).  It  is 
the  zero  crossings  of  s  which  we  are  seeking.  Let’s  underscore  the  role  of  the  symmetry 
group  G  in  the  definition  of  a.  The  construction  we  used  to  define  a  actually  defines  a 
map  a  :  G  — »  R.  In  fact,  for  any  family  or  K's  defined  by  some  map  M  -*■  7(R2),  where 
M  is  the  indexing  set  for  the  family,  we  can  define  a  :  M  — ►  R.  This  way,  one  can  easily 
add  parameters,  c.g.  to  allow  different  size  operators,  and  this  type  of  analysis  is  still 
applicable. 

We  want  to  show  that  for  a  :  M3  -*•  R  (where  M3  is  a  3-dimensional  manifold)  the 
“zero  crossings”  cannot  be  an  edge  locus.  To  do  this,  we  will  have  to  be  more  precise 
about  what  we  mean  by  “zero  crossing,"  and  we  will  consider  separately  the  cases  where 
a  is  dilTcren liable,  and  only  continuous.  The  2  cases  can  be  analyzed  independently;  Lhe 
continuous  case  subsumes  the  dilTcreuliable  case,  but  since  the  differentiable  case  provides 
belter  insight,  we  treat  it  first. 

To  begin  with,  we  make  some  observations  about  when  we  can  conclude  that  a  is  con¬ 
tinuous  or  continuously  dilfcrcntiablc.  Here  is  Lemma  2  or  Ch.  XTV  §1  (p.  375)  of  (Lang 
1069]. 

Wc  l ukc  in  the  space  of  all  licbcsgue  inlrgrnblc  functions,  with  the  norm  r’ivcii  by  ||/||  =  f  |/| dft. 
cquivnlcnccd  by  the  functions  of  norm  0.  Cf.  the  definition  of  L'1  given  in  the  later  section  on  the 
nonlinear  reflection  operator. 

Lemma.  Let  X  be  a  measured  space  with  positive  measure  p.  Let  U  be  an  open  subset 
of  Rn.  Let  /  be  a  function  on  X  X  U.  Assume: 

1)  For  each  y  £1/  the  function  x  h-v  f(x,y )  is  in 
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Tt(K)  is  what  you  get  Vhen  you  “move  K  by  g;”  the  right  hand  side  of  its  definition 
shows  how  to  calculate  the  new  K .  For  a  translation  t«,  Tr.g(K)(p)  =  K(p-  x),  which 
is  often  baffling  for  beginning  students. 

Define  the  inversion  operator,  t,  by 

i :  R8  -*  R8 

*»-+—* 

Note  that  t-1  =  i,  and  that  in  R2,  inversion  is  the  same  as  rotation  by  180°. 

Using  the  notation  Tor  inversion  and  translation,  the  convolution  formula  can  be  rewritten 

K  */?(*)  =  J Kotor-1  FdA 

where  x  is  now  a  generic  point  ofR2  and  dA  is  the  area  measure.  Note  iot"1  =  rsot  = 
(t«  o  i)~'.  So,  using  the  T  notation, 

K  *  F[x)  =  J  Tt.ol(K)  ■  F  dA 

or,  abusing  the  notation  somewhat, 

K  *  F  (*)  s»  /  Tt(K)  •  F  dA 

We  can  make  the  notation  more  compact  by  using  the  £»2  inner  product  (• ,  •),  defined  by 
(g,h)  —  /  ghdAi 

K  *F(x)  =  (Tx(K),F) 

We  can  define  a  rotation  operator,  p,  by 

p,  :  Ra  R8 
re* 
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or,  in  vector  notation 

K*F{x)  =  f  K(x  —  QF{QdA 

~  (6A 

We  want  to  use  a  more  abstract  notation  for  this,  so  that  we  can  generalize  it  slightly  in 
a  transparent  way. 


Let  g  :  R2  — »  R2  be  an  invertible  map,  c.g.  a  rotation  or  translation  of  the  plane,  and  let 
K  :  Ra  — ►  R.  To  describe  “doing”  g  to  K,  define  the  map  T9  :  7(R2)  -»  /(R2),  where 
/(R2)  is  a  space  of  functions  each  taking  R2  — *  R,  by 

T,(K)  =  Kog-' 


Observe  that  T90h[K)  —  K  o[goh)~l  =  K  o  h~l  op-1,  so  the  argument  transformations 
“go  in  reverse  order"  from  the  space  transformations.  Notice  also  the  in  (cresting  fact 
that  Tg  is  a  linear  map,  even  ir  K  and  y  arc  not.  Proof:  Tg(aK  4 ■  L)  =  (a K  +  L)  o  g~l . 

In  particular,  let  G  be  the  translation  group  or  R2,  where  rx  £  C  is  defined  by 

t,  :  R2  — ►  R2 

pt-»  p  +  * 
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Edge  localization*  in  both  9  and  x 
Introduction 

One  way  to  localize  edges  is  by  finding  zero  crossings  of  a  convolution  operator.  This 
method  yields  a  precise  value  for,  say,  the  z-coordinatc,  but  to  determine  the  orientation 
of  the  edge  requires  further  processing,  e.g.  using  a  number  of  oriented  operators  (which 
may  disagree  as  to  the  z- position)  or  by  observing  the  locus  of  zero  crossings.  An 
integrated  method  of  extracting  the  position  and  orientation  would  be  preferable. 

[Hi n ford  1981]  proposes  localizing  edges  in  direction  and  orientation  simultaneously  by 
convolving  a  lateral  inhibition  signal  with  a  directional  operator  and  viewing  the  results 
as  a  set  of  “stacked  planes,”  one  for  each  orientation  of  the  operator.  The  estimate  for  the 
edge  would  be  based  on  finding  maxima  of  the  gradient  of  the  lateral  inhibition  signal  with 
respect  to  position  and  angle,  by  seeking  zero  crossings  of  the  partial  derivatives.  Since 
all  of  the  operations  prior  to  finding  zeroes  would  be  implemented  as  convolutions,  it  is 
the  zero  crossings  of  the  resulting  convolutions  which  arc  sought.  As  ultimately  stated  in 
(liiuford  1981],  2  convolutions,  corresponding  to  2  partial  derivatives  must  be  considered. 
However,  it  is  natural  to  ask  first  whether  this  can  be  accomplished  by  finding  the  zero 
crossings  of  a  single  convolution.  In  the  following,  we  show  that  this  is  impossible,  using 
the  inverse  function  theorem  in  what  is  essentially  a  dimensionality  argument.  This  is 
why  it  is  necessary  for  [Binford  1981]  to  require  the  use  or  2  convolutions. 

Some  Mathematics  of  Parametric  Convolutions 

Let  F  :  Ra  -»  R  be  a  picture  function,  and  K  :  R2  -*  R  a  convolution  kernel,  that  we 
also  refer  to  as  a  convolution  operator.  The  normal  definition  of  the  convolution  K  *  F 

is 

K*F(x,y)  =  J  K(x-t,y-  i j)F(t,  v)  d£dt,t 
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this  at  length  in  the  sdrvey  chapter.)  We  were  led  to  a  more  general  operator,  based 
on  symmetry  considerations,  which  turns  out  to  be  intrinsically  nonlinear.  We  describe 
this  novel  operator,  including  some  of  the  theory  around  it,  after  discussing  an  idea 
of  Binford’s  for  an  operator  using  a  ratio  of  linear  terms,  also  based  on  symmetry 
considerations.  The  nonlinear  operator  avoids  some  of  the  shortcomings  of  linear  filters. 

Finally,  we  propose  a  variational  technique  for  combining  local  edge  data  into  optimal 
global  edges.  The  key  new  observation  is  that  the  globalization  problem,  can  be  put 
into  a  form  nearly  identical  to  the  Lagrangian  formulation  or  mechanics.  This  allows  the 
global  variational  problem  to  be  reduced  to  completely  local  conditions. 
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Introduction 

In  this  chapter  we  present  original  attacks  on  some  of  the  problems  we  discussed  in 
Chapter  2,  our  edge  detection  survey.  Major  problems  in  edge  finding  are  detection, 
localization,  and  globalization;  and  the  most  frequent  tool  is  convolution.  Detection 
consists  in  determining  whether  or  not  an  edge  is  present  in  a  given  neighborhood. 
Localization  is  the  extraction  of  the  precise  position  and  orientation  of  the  edge.  By 
globalization,  we  mean  finding  edge  contours  of  large  extent,  in  contrast  to  local  edge 
finding,  which  is  concerned  only  with  small  neighborhoods.  Convolution  is  commonly 
used  Tor  deriving  local  information,  frequently  in  a  form  similar  to  matched  filtering. 
Creator  detail  can  be  found  in  Chapter  2,  the  survey  of  edge  detection. 

We  begin  by  establishing  some  background  mathematics  Tor  studying  families  of 
convolution-like  operators  which  arc  defined  by  some  group,  such  as  rotation.  We  then 
use  this  formulation  to  prove  an  original  theorem  showing  that  a  single  family  of  such 
operators,  parametrized  by  rotation,  is  not  adequate  Tor  simultaneous  position  and  orien¬ 
tation  localization  by  a  zero-crossing  method.  The  consequence  is  that  more  involved 
methods,  with  multiple  families,  arc  required  for  this  type  of  attack. 

We  present  a  novel  localization  operator  which  uses  a  least  squares  fit  to  find  a  best  local 
zero-crossing  line. 

An  obstacle  in  detection  is  that  although  matched  filters  give  high  response  at  edges,  they 
also  often  give  above  threshold  response  at  uninteresting  features.  (We  have  discussed 

Some  mathematical  background  which  in  assumed  in  this  chapter,  such  its  functional  notation  and  some 
rcMillti  from  differential  topology,  is  explained  in  more  detail  in  the  Gnc  print  of  the  chapter  Geometric 
Methods  in  Vision. 
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hence  only  a  finite  number  or  final  states.  One  should  give  some  thought  as  to  whether 
that  is  an  acceptable  situation.  It  could  be  remedied  by  altering  the  relaxation  coefficients 
based  on  the  current  state. 
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be  performed  by  a  system  of  ordinary  differential  equations.  Incidentally,  economies  can 
be  gained  by  transforming  the  state  space  so  as  to  diagonalize,  upper  triangularize,  or 
Jordan  normal  form-alize  the  relaxation  map  if  it  is  linear  or  approximately  so.  Even 
if  the  relaxation  is  not  embeddable  in  a  continuous  dynamical  system,  nearly  all  the 
machinery  of  dynamical  system  theory  is  available.  For  example,  if  the  fixed  points  are 
known,  theorems  are  available  telling  us  under  what  conditions  there  is  convergence  near 
the  fixed  point,  whether  the  system  is  stable  (i.e.  robust  with  respect  to  the  choice  of 
relaxation  parameters),  etc. 

Evaluation 

The  experimental  results  presented  arc  unfortunately  not  very  impressive.  But  that 
may  well  be  because  the  continuous  spectrum  of  labellings  generalization  has  not  been 
made,  and  because  the  relaxation  (compatibility)  coefficients  arc  chosen  ad  hoe  without 
any  rigorous  consideration  of  robustness.  So  the  poor  experimental  results  should  not 
be  regarded  as  an  indictment  of  the  idea.  Rather,  it  should  be  developed  with  greater 
sophistication.  For  example,  one  should  consider  the  effects  of  noise  in  a  quantitative 
way.  One  should  try  to  discover  whether  there  are  any  global  quantities  being  optimised 
in  the  solution.  One  might  consider  generalizations  to  infinite  sets  or  objects,  e.g.  curves. 
Thus  the  local  label  would  be  a  probability  density  function  for,  say,  edge  orientations 
and  strengths.  This  leads  to  an  infinite  dimensional  state  space,  and  although  there 
is  a  respectable  theory  of  dynamical  systems  in  such  spaces,  one  must  confront  the 
computational  difficulties.  However,  since  the  function  factors  composing  the  space  are 
on  compact  domains,  there  is  a  natural  decomposition  in  terms  of  Fourier  series  (of  the 
probability  densities  in  the  orientation  strength  domain),  and  it  is  equally  natural  to 
truncate  these  scries,  so  one  again  obtains  a  finite  dimensional  characterization  of  the 
state  space.  One  would  then  want  to  study  the  relationship  between  such  a  process  and, 
say,  variational  methods.  One  can  expect  that  for  reasonably  regular  (finite  dimensional) 
systems,  there  will  be  a  finite  number  of  fixed  points  outside  of  a  small  neighborhood, 
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The  authors  generalize  *a  method  first  developed  in  computer  vision  by  (Waltz  1072]  for 
propagating  constraints  in  a  graph.  Waltz  called  it  “filtering”  and  used  a  sequential 
process;  the  present  authors  call  it  “relaxation”  (perhaps  due  to  its  similarity  to  a  method 
used  for  solving  partial  differential  equations,  though  it  is  not  derived  from  it)  and  do 
it  in  an  essentially  parallel  way.  One  starts  with  some  finite  set  of  objects,  some  set  of 
interpretations  for  each  object,  and  a  graph  where  the  nodes  are  the  objects  and  arcs 
represent  mutual  constraints  between  interpretations  (“labellings”)  of  the  objects.  The 
authors  treat  3  types  of  labelling  sets:  discrete  (finite  set  of  labels)  fuzzy  (finite  set  of 
labels  with  weights  between  0  and  oo),  and  probabilistic  (finite  set  of  labels  with  weights 
between  0  and  I).  A  generalization  to  a  continuous  set  or  labels  is  not  hard  to  imagine, 
^and  would  be  useful  in  the  applications,  for  example  to  represent  the  orientation  of  an 
edge.  For  the  probabilistic  case,  they  readily  show  that  the  relaxation  process  has  a  fixed 
point,  a  necessary  condition  for  convergence.  They  go  on  to  show  that  for  a  class  of  linear 
operators  with  eigenvalues  of  norm  no  more  that  unity,  convergence  to  the  unique  fixed 
point  is  guaranteed.  Unfortunately,  it’s  not  an  interesting  case,  because  the  fixed  point 
is  independent  of  initial  conditions,  i.c.  input  data.  They  also  present  a  more  interesting 
nonlinear  operator,  but  arc  unable  to  prove  that  it  converges.  One  can  probably  invoke 
one  of  many  variations  of  the  contraction  mapping  theorem  to  show  convergence  for  their 
linear  case  as  well  its  nonlinear  mappings  which  arc  contractions  in  the  appropriate  sense, 
thereby  expending  less  cITort  and  achieving  greater  generality.  The  important  point, 
however,  is  that  a  wide,  useful  class  of  such  relaxation  operators  converges.  One  can  even 
say  something  about  the  speed  of  convergence,  based,  for  example,  on  the  eigenvalues 
of  the  relaxation  iteration  operator.  The  idea  is  closely  related  to  dynamical  systems, 
which  has  interesting  implications  for  neurophysiology  and  hardware  design.  If  one  views 
the  state  space  as  a  free  vector  space  on  the  labels  over  the  field  of  weights  (which  we 
take  to  be  R),  then  the  relaxation  is  a  map  of  that  space  to  itself.  If  that  map  is  a 
diffcomorphism,  it  may  be  embeddable  as  a  time-one  map  of  a  flow,  i.c.  it  may  be  the 
discrete  lime  snapshot  of  a  continuous  dynamical  system.  In  that  case,  the  process  can 
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are  possible  edge  elemehts  (pixel  adjacencies)  and  the  directed  arcs  arc  the  allowed  edgel 
successors. 

The  computation  cost  varies  with  S/N,  since  that  is  .hat  determines  how  much  searching 
must  be  done.  This  is  presented  as  a  positive  feature. 

He  uses  a  pairwise  FOM  of  edge  strength  (nearest  neighbor  difference),  but  suggests  that 
a  larger  local  operator  would  improve  noise  performance. 

Evaluation 

Summary 

The  technique  is  susceptible  to  the  standard  problems  associated  with  FOM. 

The  local  operator  is  still  very  important. 

The  same  problems  as  in  [Montanari  1970,  Montanari  1971]  arc  still  present. 

The  results  look  reasonable,  but  no  results  are  presented  for  real  images. 

Analysis  is  required  to  decide  whether  the  process  can  be  made  parallel — 
as  it  stands  it  is  intrinsically  sequential. 

The  technique  presented  by  Marlclli  in  this  paper  is  not  usable  in  its  current  form.  With 
an  appropriate  local  operator,  reasonable  FOM,  the  right  discrete  variables  (i.c.  edgel 
parametrixation),  it  might  produce  reasonable  results.  Hut  that  says  only  that  global 
edge  finding  can  be  approached  as  a  search  problem.  Furthermore,  it  seems  likely  that 
parallel  search  methods  would  be  cheaper  (as  well  as  faster)  than  sequential,  in  analogy  to 
simultaneous  backward  and  forward  searching  in  classical  search  problems.  An  intriguing 
idea  is  to  use  geometric  information  (i.c.  relative  direction)  of  other  growing  edges  to 
compute  the  heuristic  function  (i.c.  expansion  ordering)  Tor  the  search  problem:  cdgcls 
would  be  tried  first  that  led  toward  something  they  might  mate  with. 

Roaenfeld,  Hummel,  Zucker  1975,  Zueker,  Hummel,  Rosenfeld  1977 
"Scene  Labelling  by  Relaxation  Operations’ 
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implementation  (i.e.,  orfe  without  special  processing  for  these  other  parameters). 

The  dynamic  programming  approach  is  computationally  very  efficient;  generalisations 
and  adaptations  of  Montanari’s  method  are  probably  worth  pursuing,  although  it  is  not 
a  trivial  matter  to  do  so. 

Martelli  1972,  MarteUi  1973 

“Edge  Detection  using  Heuristic  Search  Methods * 

Following  Montanari,  Martelli  suggests  that  heuristics  should  be  embedded  in  a  figure  of 
merit  (FOM)  rather  than  in  code.  But  it  is  questionable  whether  an  FOM  is  enough  in 
the  way  of  heuristics — especially  if  it  is  not  based  on  an  analysis  of  real  images. 

11c  shows  that  any  dynamic  programming  problem  can  be  posed  as  a  minimal  path  in  a 
graph  problem,  arguing  that  this  is  good  because  the  use  of  heuristics  to  speed  up  search 
in  a  graph  is  well-studied.  However,  the  equivalence  result  is  not  very  deep  (each  variable 
expands  to  a  set  of  nodes,  one  for  each  value).  The  advantage  of  dynamic  programming  is 
that  it  is  far  cheaper  than  graph  search,  and  a  better  question  is  usually  whether  a  graph 
search  problem  can  be  cast  as  a  dynamic  programming  one.  One  can  apply  heuristics  in 
the  dynamic  programming  paradigm  as  well. 

The  variables  Xi  of  the  dynamic  programming  problem  FOM  =  /(* i, .  x„)  are 

the  edge  elements- -  discrete  valued  and  thus  hard  to  generalise  to  continuous  0  edgcls. 

The  figure  of  merit  is  defined  in  the  form  FOM  =  £  ci(x*>  •  ■  • » xi+k)  -  see  the  criticism 
of  Montanari  that  monotonically  related  c<’s  don’t  lead  to  the  same  optimum.  In  this 
connection,  no  discussion  of  robustness  with  respect  to  FOM’s  is  presented. 

lie  derives  a  search  graph  for  the  dynamic  programming  problem,  then  uses  the  A* 
algorithm  to  search  the  graph.  The  search  graph  is  just  a  directed  graph  where  the  nodes 
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Case  2:  a  of  class  C°* 

We  restrict  attention  to  the  function  defined  on  the  x,  0  manifold,  and  show  that  every 
zero  crossing  is  an  accumulation  point  of  zero  crossings. 


Look  at  Fig.  (proofl).  What  it  shows,  schematically,  is  an  edge  operator  in  the  vicinity 
of  an  edge,  and  the  result  of  applying  some  motions  to  it.  The  positions  arc  labelled  on 
an  arbitrary  scale.  The  (0,0)  position  is  where  the  zero  crossing  is.  If  one  assumes  there 
arc  no  other  zeroes  in  some  neighborhood,  the  indicated  operations  show  that  there  are 
2  ways  to  get  to  the  same  position  of  the  operator  with  opposite  signs  for  the  result,  a 
contradiction. 

The  notched  circle  represents  the  position  and  orientation  of  the  operator,  while  the 
vertical  line  is  a  reference  value  for  x  and  is  meant  to  suggest  the  edge  locus.  Starting  at 
the  upper  left  picture,  we  can  get  to  the  upper  right  picture  by  translating  the  operator 
support  to  the  left  slightly,  which  is  the  meaning  of  the  small  arrow  over  the  long  arrow. 
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Since  we  were  at  a  xero^crossing  in  x,  the  value  we  get  by  applying  the  operator  at  this 
new  position  must  be  nonzero;  let  us  call  its  polarity  +,  which  we  indicate  near  the 
coordinates.  Instead  of  translating,  we  can  rotate,  and  this  is  schcmaticized  in  going 
from  the  upper  left  picture  to  the  lower  left.  Again,  since  we  start  at  a  zero-crossing  in  0, 
a  slight  rotation  puts  us  into  a  nonzero  value.  Call  its  polarity  — .  Wc  can  assure  that  it  is 
not  the  +  of  the  upper  right  corner  because  we  have  a  choice  of  2  directions  of  rotations; 
by  the  definition  of  zero-crossing,  one  of  these  will  give  us  +  and  the  other  will  give  — ,  so 
we  choose  the  one  which  gives  — .  Now  wc  have  a  contradiction  to  the  assumption  that 
the  zero-crossing  was  isolated,  when  wc  observe  what  happens  as  wc  try  to  get  to  the 
lower  right  corner  position.  Wc  have  assumed  that  the  zero-crossing  at  the  upper  left  was 
isolated.  Coing  from  the  lower  left  configuration  to  that  of  the  lower  right  by  a  slight 
translation  in  the  absence  of  a  zero-crossing  requires  that  the  polarity  or  the  lower  right 
position  be  — .  On  the  other  hand,  doing  the  same  thing  by  moving  from  the  upper  right 
by  a  slight  rotation  to  the  lower  right,  with  the  assumption  of  no  zero-crossing,  yields  a 
polarity  of  +  for  the  lower  right.  The  value  of  the  operator  applied  in  the  position  of  the 
lower  right  corner  can  only  have  one  sig.i,  so  in  Tact  there  must  be  another  zero-crossing 
somewhere,  contrary  to  assumption.  Note  that  this  is  still  true  no  matLcr  how  tiny  the 
rotations  and  translations  of  the  operator. 
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A  more  abstract  picture  of  this  is  Fig.  (proof2),  which  shows  a  region  of  the  x,  0  manifold 
near  the  zero  crossing.  In  particular,  we  can  assume  without  loss  of  generality  that  moving 
up  (i.e.  rotating)  causes  *  to  become  +  (else  dip  the  picture  top  for  bottom),  while  moving 
right  (translating)  causes  »  to  become  —  (else  (lip  right  for  left).  The  restriction  of  a  to 
the  line  joining  the  2  end  points  or  these  motions  must  have  a  zero,  by  the  intermediate 
value  theorem  (see,  e.g.  (Itudin  190-1]).  QED| 
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Nonlinear  Local  Edge  Detection 

An  ideal  step  function  is  the  sum  of  an  even  part — a  constant  function — and  an  odd 
part — a  symmetrical  step  (top  of  Fig.  (latinh)).  For  edge  detection,  it  is  the  odd  part 
which  is  of  interest.  [Canny  1983],  for  example,  requires  that  his  optimum  convolution 
kernel  be  an  odd  function,  since  the  oven  part  cannot  contribute  to  detection  of  a  step. 
The  bottom  of  Fig.  (latinh)  shows  the  (1-dimcnsional)  result  of  applying  lateral  inhibition 
to  a  step  edge,  i.e.  of  convolving  a  step  edge  with  a  zero-sum  difference  of  boxes  (middle 
of  Fig.  (latinh)). 


Fig.  (latinh) 
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Since  lateral  inhibition  &  an  even  operator,  the  result  is  again  an  odd  function.  Also,  it  is 
the  central  zero  crossing  which  marks  the  edge,  while  the  lateral  inhibition  has  introduced 
spurious  peripheral  zero  crossings. 

A  common  approach  to  detecting  signals  like  this  is  through  matched  filtering,  template 
matching,  or  surface  fitting,  all  of  which  are  essentially  equivalent  linear  processes. 
However,  these  tend  to  respond  to  undcsircd  components  while  remaining  specialized  to  a 
particular  functional  shape.  [Dinford  1981)  proposed  using  an  even-odd  characterization 
for  dealing  with  this  problem.  We  have  used  a  somewhat  dilTcrcnl  characterization  of 
even-odd,  which  led  to  the  edge  detector  described  below,  an  intrinsically  and  nontrivially 
nonlinear  operator.  Nonlinearity  has  the  advantage  that  space  and  intensity  arc  not 
equivalent.  I.c.,  a  linear  operator  has  no  way  to  tell  the  difference  between  a  high  but 
localized  noise  spike  and  a  large  moderately  positive  area.  While  linearity  always  has  this 
problem,  nonlinearity  can  avoid  it.  Also,  the  even-odd  characterization  is  more  general 
than  a  matched  filter  kernel,  and  thus  detects  a  more  general  class  of  functions,  so  one  is 
not  limited  to  the  ideal  step.  The  reflection  operator  described  below  will  find  edges  on 
a  checkerboard  pattern  smaller  than  its  support,  something  a  matched  filter  cannot  do 
(unless  it's  a  matched  filler  for  that  size  of  checkerboard,  of  course). 


Fig.  (checkerboard) 


K  A 
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The  reflection  operator  can  be  thought  of  as  adding  up  a  measure  of  edgeness  along  each 
line  perpendicular  to  the  prospective  edge,  regardless  of  polarity.  This  could  be  done 
with  an  operator  linear  on  each  such  line,  if  a  nonlinear  operation  such  as  absolute  value 
or  squaring  were  done  before  summing  the  line  values.  This  would  result  in  essentially 
the  same  operator,  though,  once  all  the  nonlinear  terms  were  herded  up. 

Let  /  be  the  laterally  inhibited  picture  function.  Then  the  even  and  odd  parts  are  defined 
by 

/...„(*)  =  i|/(»)  +  /(-*)] 

/.«(*)-  ji/w 

To  make  the  notation  more  compact,  we  can  define  /  by  /(x)  =  /(— x).  Then 

/«««  38  /I 

/.dd  =  \\I-~f) 


The  even-odd  operator  of  [Binford  1981] 

[Binford  1981]  describes  using  the  even  and  odd  parts  as  follows.  Let 


where  we  understand  that  we  can  consider  this  a  summation  by  using  a  discrete  measure. 
Then,  in  terms  of  our  previous  definitions, 
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R  +  L=  f  (/(*)  +  /(-*)) dx 
Jo 

rw 

=  2/  /even(*)d* 

Jo 

fw 

R-L=  /  (/(*)-  f(-x)\ dx 

Jo 

fw 

=  2/  /odd(*)rf* 

Jo 


The  even-odd  measurement  is  then  given  by 


R+L 

C  /cvcn(z)dx| 

R-L 

Jo"  fodd(x)dx 

Notice  that  the  only  nonlinearity  here  is  in  the  ratio,  and  there  arc  no  cross-terms  in  /, 
since  the  integration  is  done  over  an  argument  linear  in  /.  Also 

,  L 

K±L  1+B 

It-  L  L' 

R 

so  one  is  essentially  looking  at  the  value  of  I,/ It.  This  compulation  of  even-odd  parts  has 
a  simple  interpretation  in  terms  of  convolutions.  Define  K+,  K _  as  in  Fig.  (U-L). 
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Then 

R  +  L  =  K+*f 
R  —  L=  K-  *  / 


The  nonlinear  reflection  operator 


We  made  a  somewhat  different  interpretation  of  comparing  even  and  odd  parts;  we 
compare  the  relative  sizes  of  the  even  and  odd  parts  by  considering  their  norms  as  elements 
of  a  function  space.  I.e.  in  our  formulation,  the  detector  output  is 


ll/cven|| 
ll/odd  || 

where  ||-||  is  the  norm  coining  from  an  inner  product  (• ,  •).  For  the  continuous  case,  this 

could  be  the  inner  product  on  L2,  while  for  the  discrete  case  it  could  be  that  on  0. 

We  lake  L 2  as  the  space  of  all  l.el>csguc  square  inlcgrablc  real-valued  functions,  cquivalcnccd  by  the 
functions  of  square  integral  0,  with  the  inner  product  (f,g)  —  f  fgdfi,  where  ft  is  l.cbesguc  measure. 
The  norm  ||/||  is  then  given  by  ||/||2  =  (/, /)  =  f  f2  dp.  More  generally,  one  speaks  of  l>2(ft),  for  an 
appropriate  measure  p.  If  we  take  ft  to  be  the  discrete  measure,  giving  the  value  I  at  each  integer,  we 
gel  the  space  l2,  of  square-sum  inable  sequences. 


Wc  then  compute 


ll/eve,  .IP 

H/oddlP 


11/  +  7ll3 
11/ -/II3 

ll/H2 +  I17I12 +  2(/,7) 
ll/ll3  +  ll/ll3 -2(/, /) 


ll/ll2 -M/,7) 
ll/ll2  —  (/.  7) 
,  +  (a7) 
ll/ll2 


i  - 


(/,/) 

ll/ll3 


since  ||/||  =  ||7|| 


Thus  it  is  the  relative  size  of  the  cross  term  that  is  important.  The  function 
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is  monotonic  in  z.  We  are  interested  only  in  the  relative  values  of  the  detector  output; 
and  in  practice  will  threshold  on  its  value,  so  it  is  sufficient  to  consider  only 

(/./) 

Il/lla 


Note  that 

l(/,7)l  <  ll/ll’ 


and 


ll/ll8  *f  /  18  evcn 

-ll/ll*  if /is  odd 


The  higher  dimensional  generalization  of  /  we  are  interested  in  is  reflection  across  a 
hypcrplanc;  in  2  dimensions  this  is  reflection  across  a  line.  This  can  be  defined  by  choosing 
some  coordinate  system  (x,y)  and  defining  /  by 


/(*,»)  =  /(“*.») 

so  that  wc  arc  looking  at  the  odd  and  even  parts  along  the  z-axis.  Instead  of  reflection, 
we  could  have  generalized  instead  to  inversion,  defined  by 


?[x,v)  =  /(-*.“») 


This  could  be  interpreted  as  looking  at  the  odd  and  even  parts  across  all  lines  through 
the  origin  at  once. 

Groups  and  families  of  quadratic  operators 

Using  the  notation  introduced  in  our  section  3.2.2  on  parametric  convolutions,  these 
operators  arc  of  the  form 


V.7(Rn)-  R 
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where  g  is  an  isometry  of  Rn.  Since  this  operation  is  to  be  performed  at  every  point  of 
the  image,  we  can  parametrize  it  by  a  shift  as 

V,  :  7(Rn)  ->  7(Rn) 

*,(/){*)  =  (f,T,T,T-x{f)) 

This  is  roughly  the  same  as  our  definition  of  parametrized  convolution,  except  that  the 
fixed  convolution  kernel  K  is  replaced  by  the  function  /  itself,  giving  an  operator  which 
is  nonlinear  in  /.  For  n  =  1,  inversion  and  reflection  arc  one  and  the  same.  For  n  =  2, 
we  have  chosen  reflection  for  the  group  clement  g.  'ifg  can  be  thought  of  as  a  machine 
which  takes  an  image  as  input  and  gives  as  output  another  image,  whose  value  at  each 
point  is  a  measure  of  the  invariance  of  the  input  under  the  symmetry  g  applied  at  that 
point.  For  example,  suppose  g  is  a  translation.  Then  since  Tg,Tx,T-x  all  commute,  the 
value  of  Vg(f)  will  not  depend  on  *,  and  ♦,(/)  will  be  the  constant  function  with  value 
{f,Tg(f)).  If  we  now  let  g  range  over  all  translations,  we  get  a  function  on  the  translation 
group,  viz.  the  autocorrelation  function  of  /.  Now  let  g  be  reflection  across  the  line  t 
through  the  origin.  ♦ ,  takes  an  input  function,  and  produces  an  output  function.  To 
find  the  value  of  the  output  function  at  a  point  z,  translate  the  input  function  so  that  z  is 
at  the  origin,  transform  the  input  function  by  g  (i.c.  reflect  across  (),  and  translate  back 
to  z,  then  take  the  inner  product  with  the  un transformed  input  function.  That’s  now: 
the  value  of  the  output  at  z.  Usually,  we  arc  interested  in  doing  this  for  local  support,, 
i.c.  the  result  should  only  depend  on  a  neighborhood  or  each  point,  or  be  weighted  near 
the  point.  We  can  build  this  into  the  inner  product  by  using  a  suitable  measure,  so  that 
this  situation  is  still  described  by  the  same  formalism,  except  it  is  now  more  convenient 
to  write 

V,  :  7(R")  -  7(R") 

*.(/)(*)  =  {T.x(f),rtT-x(f)) 

which  amounts  to  taking  the  inner  product  at  the  origin,  rather  than  first  translating 
back  to  the  point  of  interest.  Since  translation  is  an  isometry,  this  docs  not  affect  the 
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value  of  the  unweighted  inner  product,  while  for  a  locally  weighted  operator,  the  inner 
product  is  just  defined  once,  at  the  origin. 


Fig-  (*,) 

[Sunday  1978)  has  shown  that  is  invariant  under  isometry,  in  the  sense  that 


*«(Tfc/)  =  Th*t(f) 


for  all  iHomclrics  h,  if  ami  only  if 


g  €  Ccntcr(0(n)) 


where  O(n)  is  the  group  of  orthogonal  transformations  of  Rn,  i.c.  n  X  n  matrices  of 
determinant  ±1,  so  the  condition  says  that  g  must  commute  with  all  of  O(n).  Now, 
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Center(0(n))  =  {/,  —  f }.  I.e.  the  center  consists  only  of  inversion,  —I,  and  of  course 
the  identity,  /.  Notice  that  for  2  dimensions,  inversion  is  the  same  as  a  rotation  by  180°. 
Thus,  in  2  dimensions,  g  can  only  be  inversion  for  ¥0  to  be  unaffected  by  an  arbitrary 
isometry,  except  of  course  by  being  carried  along  with  it.  In  particular,  this  means  that 
¥0  is  not  directional:  • 


'M^/KO)  =  7\¥,(/)(0)  =  ¥f(/)(0) 

If  we  confine  attention  only  to  invariance  under  rotations,  the  situation  is  somewhat 
different.  The  rotation  group  of  Rn,  SO(n),  is  the  component  of  O(n)  with  determinant 
+1.  Since  50(2)  is  commutative,  it  is  its  own  center.  Thus,  non-dircctional  operators 
in  this  family  could  be  defined  to  measure  the  symmetry  of  rotation  by  some  arbitrary 
angle.  On  the  other  hand,  reflection  through  a  given  line  is  not  in  Center(0(2))  or  in  the 
center  of  50(2)  in  0(2)  (i.e.,  all  those  elements  of  0(2)  which  commute  with  all  of  SO(2)); 
and  the  operator  it  induces  is  therefore  not  invariant  under  isometry  or  rotation,  as 
should  be  clear  after  some  reflection.  The  reflection  operator  is  a  directional  operator, 
and  must  be  applied  for  a  family  of  lines  of  reflection. 

[Sunday  1978]  has  used  essentially  the  operator  above,  with  g  —  inversion,  for  binary 
pictures  as  a  symmetry  detector.  For  that  case,  it  is  interesting  that 

*-rt/)(*)  =  pp(2x) 

Wc  note  in  passing  that  this  could  be  considered  as  a  2nd  order  term  in  a  Taylor  expansion 
in  the  Fourier  domain. 

Noise  performance 

Linear  operators  mitigate  noise  by  averaging,  thus  reducing  the  normalized  variance. 
The  nonlinear  operators  of  the  class  defined  above  also  exploit  the  correlation  properties 
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of  the  noise  more  directly.  In  the  presence  of  noise,  these  operators  contain  a  linear 
and  a  quadratic  noise  term.  The  linear  term  behaves  essentially  like  that  of  a  linear 
operator,  though  it  is  signal-dependent.  It  is  often  assumed  that  the  noise  is  Gaussian 
and  uncorrelated.  In  that  case,  the  quadratic  noise  term  vanishes.  (I.e.,  on  the  assumption 
that  the  noise  is  uncorrelatcd  under  the  group  action  involved,  its  contribution  vanishes 
except  on  the  fixed  points  of  g.  In  the  continuous  case,  this  is  a  set  of  measure  0.  In  the 
discrete  case,  this  set  may  not  be  of  measure  0,  so  care  should  be  taken  to  avoid  including 
the  fixed  part,  c.g.  the  line  of  reflection.  It  only  adds  to  the  noise,  and  measures  nothing 
of  interest.  If  there  has  been  a  preprocessing  step,  such  as  lateral  inhibition,  additional 
care  must  be  taken,  because  not  all  output  terms  will  be  uncorrelated.)  Thus,  while  the 
nonlinearity  in  the  signal  term  gives  a  quadratic  gain,  there  is  no  comparable  contribution 
from  the  noise.  Furthermore,  the  linear  noise  term  is  scaled  by  the  signal,  providing 
additional  noise  immunity. 

Implementation 

The  nonlinear  reflection  operator  was  implemented  for  a  support  size  of  —  100  pixels, 
with  uniform  weighting.  The  image  was  first  convolved  with  a  difference  of  boxes  lateral 
inhibition  operator  or  similar  dimensions,  with  central  region  or  typically  9  pixels.  The 
reflection  operator  was  used  only  for  detection,  using  a  threshold  which  was  set  based 
on  a  global  estimate  or  noise.  The  results  were  qualitatively  belter  than  those  obtained 
for  various  linear  detection  predicates  based  on  difference  of  boxes.  With  tight  coding, 
including  automatic  compilation  of  in-line  machine  code  Tor  each  operator  at  run  time, 
thus  avoiding  any  subscript  computations  later,  the  cost  was  essentially  the  same  as  for 
a  linear  operator  of  comparable  support  and  nontrivial  coefficients. 
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Planar  Fit  Edge  Location 

Applying  lateral  inhibition  [Binford  1981)  to  a  perfect  step  edge  results  in  a  central  planar 
region  whose  zero  crossing  line  corresponds  to  the  edge  locus.  We  implemented  an  edge 
location  operator  which  solves  for  this  zero  crossing  by  finding  the  parameters  of  the 
approximating  plane  in  the  appropriate  region. 

Let  L  be  a  lateral  inhibition  operator,  /  the  input  picture,  and  L(f)  the  result  of  lateral 
inhibition.  Define  r,s  to  be  the  (discrete)  coordinate  functions  in  the  i,j  directions.  F.c., 

r,  s :  Z2  -  Z 
r:  (i,j)  * 

8 :  (*\i)  »-»  3 

Then  the  problem  of  fitting  a  plane  to  L(/)  in  some  neighborhood  can  bn  thought  of  as 
finding  u,v,w  €  R  such  that  c  is  minimized  in  the  expression 

L(f)  —  ur  +  vs  +  w  +  t 

Since  we  are  using  the  f2  norm  with  the  standard  inner  product,  minimizing  c  in  the 
least  squares  sense  is  the  same  as  minimizing  <  •  r,  which  happens  iT  we  determine  u,v,w 
by  orthogonal  projection  of  L(l*)  onto  the  hypcrplane  in  L( Z2)  spanned  by  the  functions 
r,  a,  1,  where  1  is  the  constant  function.  Since  we  are  interested  in  local  fitting,  ».e., 
fitting  the  central  planar  region  discussed  above,  the  functions  r, »,  1  must  be  taken  as 
the  restrictions  to  the  region  of  interest.  If  this  region  is  symmetrical  about  the  origin, 
it’s  easy  to  sec  that  r, s,  1  arc  all  mutually  orthogonal,  so  that  the  parameters  tz,  v,w  are 
easily  found  as 
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structure  we  propose  toaconsider  is  depicted  in  the  following  commutative  diagram,  Fig. 
(*).  This  will  require  some  technical  improvements,  which  we  make  shortly,  but  this 
simpler  picture  exhibits  the  main  ideas  in  an  uncluttered  way. 


An  example  of  functions  l<\,  Ft  for  a  neighborhood  of  a  familiar  embedding  of  an  object  £ 
is  presented  in  Fig.  (stereo  pair).  In  this  case  F\  and  Ft  take  their  values  in  R1,  which  is 
represented  as  brightness,  and  the  geometry  has  been  carefully  controlled  to  assure  that 
features  will  coincide  in  a  particularly  simple  way  (i.c.  the  images  are  rectified).  This 
pair  is  best  viewed  by  holding  the  page  at  arm’s  length,  with  the  pictures  side  by  side, 
and  crossing  one’s  eyes  so  as  to  fuse  the  2  images  into  one.  This  takes  some  practice. 
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We  will  find  it  easier  in«our  analysis  to  think  of  the  equivalent  situation  of  a  stationary 
observer  in  a  world  which  moves.  This  situation  is  depicted  in  Fig.  (egocentric  example), 
for  a  particular  choice  of  object  and  imaging  geometry. 


Fig.  (egocentric  example) 


We  can  formalise  the  egocentric  situation  in  a  mathematical  structure  which  covers  a  wide 
range  of  situations,  c.g.  different  imaging  projections.  The  essence  of  the  mathematical 
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6)  A  law  which  expresses  the  image  intensity  as  a  function  of  all  the  other  characteristics 
of  the  situation. 

Fig.  (world-o-centric  example)  is  a  schematic  representation  of  a  possible  imaging  situa- 

t 

tion.  The  name  indicates  that  we  arc  regarding  the  world  as  stationary,  while  the  ob¬ 
server  moves,  which  is  the  usual  way  of  thinking  of  a  stereo  imaging  situation.  The 
nomenclature  is  explained  in  detail  later,  and  is  unimportant  right  now. 


Fig.  (world-o-ccntric  example) 
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The  Mathematical  Structure 

Here  is  the  situation  we  are  confronted  with.  From  a  2-dimcnsional  image  or  set  of 
images,  we  want  to  reconstruct  or  at  least  describe  the  3-dimcnsional  object  that  gave 
rise  to  our  data.  Furthermore,  we  ultimately  want  to  identify  objects  independent  of 
viewpoint.  Now  if  one  can  reconstruct  the  3-dimensional  object,  then  by  brute  force  one 
can  determine  whether  2  data  sets  (a  data  set  might  be  a  picture,  a  pair  of  pictures,  a 
sequence  of  pictures)  correspond  to  the  same  3-dimensional  object.  However,  nature  is  not 
profligate  in  providing  us  with  information,  so  c.g.,  one  cannot  hope  to  reconstruct  the 
entire  object,  even  in  principle;  and  in  practice,  accuracy  is  limited.  It  would  be  helpful 
to  know,  therefore,  something  about  the  likelihood  that  various  data  sets  may  have  arisen 
from  the  same  object,  or,  more  generally,  from  a  single  meaningful  class  of  objects.  In 
particular,  it  would  be  helpful  to  know  something  about  how  different  viewpoints  affect 
the  geometry  or  topology  of  a  data  set.  So  what  we  have  is*: 

1)  A  surface  or  set  of  surfaces  embedded  in  R3. 

2)  A  canonical  map  from  R3  to  R2  (or  possibly  S2),  the  perspective  projection. 

3)  A  group  of  transformations  of  R3,  viz.  the  rigid  motions  of  R3,  which  correspond 
isomorphically  to  the  possible  ways  of  viewing  an  object  in  R3. 

■1)  A  function  defined  on  the  surface,  which  comprises  the  intrinsic  surface  characteris¬ 
tics  (c.g.  reflectivity). 

5)  A  function  defined  on  R3,  expressing  the  illumination  (which  may  depend  on  the 
embedding  and  intrinsic  surface  characteristics  as  well). 

•The  malhem.ilic.il  notation  and  Home  rclnted  definitions  arc  reviewed  in  a  fine  print  section  in  a  few 
pages.  Kor  the  moment,  it  may  lie  helpful  to  know  that  tt"  is  n-diinensional  Kuclidean  space;  and  Sn  is 
tlie  n-dimciisional  sphere,  so  S*  is  the  2-dimcnsional  sphere  -  the  surface  of  a  solid  3-dimcusionai  ball. 
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derivative  at  6  “typical"*  points  in  the  picture  is  just  enough.  That  is  for  a  monochrome 
picture;  in  the  case  of  color  we  show  that  3  such  points  give  the  same  information. 

In  between  the  beginning  and  end  of  the  chapter,  there  is  a  middle,  in  which  wc  apply 
some  differential  topology  to  study  invariant  structures  in  pictures.  We  mainly  focus 
attention  on  the  structure  of  level  sets,  the  picture  loci  of  each  intensity  value.  These 
have  an  invariant  tree  structure  with  simple  properties  given  by  Morse  theory  (part  of 
differential  topology).  We  also  consider  the  behavior  of  the  tree  in  the  presence  of  noise, 
which  again  is  well  understood,  and  propose  the  structure  as  a  good  starling  point  Tor 
stereo  matching.  In  a  later  section,  wc  show  how  the  scale  space  paradigm  is  one  way  of 
exploring  the  structure  of  the  level  set  tree,  and  wc  argue  that  the  invariant  structure  we 
propose,  including  the  noise  and  deformation  behavior,  is  a  more  complete  model  for  the 
scale  structure  of  the  image,  yet  it  requires  no  convolutions. 

Along  the  way,  we  introduce  some  ideas  wc  need  from  differential  topology,  with  an  eye 
to  explaining  their  significance  in  our  context  of  vision.  A  central  idea  is  genericity, 
a  rigorous  definition  of  “typical,”  which  allows  us  to  ignore  the  problems  of  special  or 
pathological  cases.  Without  this,  our  theorems  would  be  impossible,  as  there  would  be 
an  endless  scries  of  special  cases  and  exceptions  to  dispose;  of;  instead  wc  can  focus  on 
the  interesting  eases  that  occur  “typically.” 
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territory  of  differential  /opology.  When  we  add  in  the  group  of  rigid  motions,  we  have 
differential  geometry. 

We  apply  topological  methods  to  study  the  correspondence  problem  of  stereo  vision, 
which  seeks  to  find  corresponding  points  in  2  pictures  taken  from  different  viewpoints;  i.e. 
matching  a  point  in  one  picture  with  the  unique  point  in  the  other  picture  that  came  from 
the  same  point  on  the  object  (if  indeed  such  a  point  exists).  We  begin  by  assuming  that 
we  know  nothing  of  the  distortion  between  the  2  pictures.  If  we  can  find  this  distortion, 
then  we  will  have  solved  the  correspondence  problem.  What  we  find  is  the  novel  result 
(the  Two  Color  Theorem)  that  this  problem  is  degenerate  for  monochrome  pictures, 
but  uniquely  soluble  for  color  pictures  (of  2  or  more  color  dimensions).  This  means 
that  in  the  monochrome  ease,  the  distortion  cannot  be  found  without  making  additional 
constraints  which  depend  on  the  properties  of  the  rigid  motion  group  (i.e.  the  geometry), 
the  projections  (including  optics),  and  the  possible  relation  between  the  viewpoints.  On 
the  other  hand,  for  color  pictures,  we  can  ignore  the  geometrical  information,  or  more 
practically,  consider  it  independently  in  an  ovcrdctcrmincd  system.  We  also  extend  these 
results  to  the  situation  where  the  contrast  and  absolute  intensity  scale  or  the  pictures 
may  vary,  ami  we  consider  some  of  the  effects  of  noise. 

At  the  end  of  the  chapter  we  return  to  the  geometry  which  played  only  a  minor  role  in 
the  proof  of  the  Two  Color  Theorem,  the  geometry  which  we  now  exploit  to  analyze  the 
motion  problem,  the  differential  analog  or  the  stereo  problem.  We  take  the  view  that 
our  data  consists  only  of  pointwisc  color  values  in  the  picture.  Since  the  picture  varies 
with  time,  we  also  have  pointwisc  derivative  values.  Unlike  most  previous  work,  we  do 
not  assume  that  we  know  how  any  individual  points  in  the  image  arc  actually  moving 
(the  analog  of  correspondence),  nor  do  we  seek  to  find  that  motion  as  an  intermediary 
to  the  spatial  motion.  The  question  we  address,  then,  is  how  much  of  this  instantaneous 
pointwisc  data  docs  it  lake  to  uniquely  specify  the  motion  in  space.  We  apply  Lie 
algebra  methods  to  show  that  knowing  the  picture  function,  its  gradient,  and  its  first  time 
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In  1872,  Felix  Klein  was  admitted  to  the  faculty  of  the  University  of  Erlangen.  On 
this  occasion  he  was  required  to  give  an  inaugural  speech,  in  which  he  proposed  a 
characterization  of  the  study  of  geometry,  which  had  recently  seen  the  introduction  of 
non-Euclidcan  geometries.  His  proposal  came  to  be  known  as  the  Erlanger  Programm , 
and  was  a  unifying  influence  on  geometric  thought  for  the  next  50  years  or  so.  The  essence 
of  what  he  suggested  is  that  a  geometry  should  be  viewed  as  the  study  of  the  invariants 
of  the  action  of  some  group.  In  Euclidean  geometry,  for  example,  one  studies  invariants 
of  the  group  of  rigid  motions  of  the  plane.  One  can  view  various  geometrical  studies  this 
way,  c.g.  special  relativity  considers  the  invariants  of  the  Loren  tz  group,  while  topology 
studies  those  of  groups  of  homcomorphisms.  In  the  same  spirit,  the  task  of  computer 
vision  can  be  viewed  as  finding  invariants  of  picture  functions  under  the  rigid  motion 
group  of  3-dimcnsional  Euclidean  space. 

As  an  object  moves  in  space,  or  as  we  change  our  viewpoint,  the  projection  of  the 
object’s  points  to  the  picture  undergoes  a  deformation  which  depends  on  the  shape  of 
the  object,  the  motion,  and  the  projection.  Carried  along  with  this  deformation  is  the 
picture  function,  given  by  the  color  value  at  each  point,  which  is  a  result  or  intrinsic 
properties  of  the  solid  object,  but  which  can  depend  on  lighting  conditions  in  addition  to 
the  deformation  of  the  projection. 

Our  first  goal  in  this  chapter  is  to  make  precise  what  arc  all  these  functions,  objects, 
projections,  and  motions,  and  what  are  their  relationships;  in  other  words,  to  describe 
this  structure  in  the  language  of  modern  abstract  mathematics,  giving  us  something  to 
attack  with  rigorous  tools.  The  structure  we  find,  of  manifolds  and  maps,  is  the  natural 
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excursions  of  the  trajectory,  as  might  happen  in  trying  to  maximise  the  integral  of 
a  positive  quantity  (such  as  p{h(x)  |  £(*,*))).  In  maximising  a  positive  quantity, 
lengthening  the  curve  always  increases  the  integral,  so  e.g.  extending  the  curve  always 
improves  things,  and  one  can  have  the  pathology  of  improving  a  curve  by  taking  out  a 
tiny  piece  and  replacing  it  by  some  wild  excursion  which  accumulates  more  of  the  positive 
goodness.  Minimising  a  positive  quantity  (or  maximising  a  negative  one)  avoids  this,  since 
there  is  a  shortest  path,  i.e.  one  cannot  keep  minimising  by  always  shortening  the  path. 

In  summary,  the  outcome  is  that  the  Lagrangian  picture  allows  us  to  reduce  the  extremal 
problem  to  a  local  one.  Since  we  can  estimate  dL/dx  (x,  x)  and  dL/dx(x,x),  we  can  find, 
numerically  at  least,  the  trajectories  that  solve  the  Eulcr-Lagrangc  equations,  and  this  is 
based  on  local  information.  The  key  features  making  this  possible  are  the  existence  of 
the  Lagrangian  function  defined  on  the  space  oT  (x,  x),  and  the  constraint  that  the  only 
trajectories  of  interest  in  that  space  are  those  where  dxjdt  ~  i. 


Contributions  to  Edge  Detection  A  Variational  Principle  for  Edge  Linking  128 

Since  exponentiation  is  nonotonic,  extremizing  the  exponential  is  equivalent  to  extremis* 
ing  the  exponent.  This  leads  to  a  simple  way  to  extend  this  to  the  continuum,  by 
generalizing  the  sum  to  an  integral  (as  could  be  done  for  any  product).  The  condition 
then  becomes  one  of  maximising 

f  logp(/»(x{t))  |  E(x(t),x{tj)}  dt 

which  is  a  negative  quantity,  or,  perhaps  more  intuitively,  of  minimizing 

log  p(  *(*(*))  I  E(z{t),x{t)))dt 

I.e.,  we  can  choose  —  log  p(/»(z)  |  E(x,  x))  as  the  Langrangian  I\x,  x). 

Integrating  the  Eulcr-Lagrangc  equations  requires  an  initial  condition  (or  possibly  a 
boundary  condition).  Since  the  space  in  which  the  equations  arc  set  is  the  (x,x)  space, 
the  initial  condition  must  specify  both  x  and  x.  In  general,  different  initial  values  of 
x  will  give  different  trajectories.  That  is  the  price  one  pays  for  getting  a  completely 
local  problem.  However,  this  can  be  readily  dealt  with  by  separately  maximizing  over 
directions  of  x,  or  choosing  initial  x  at  points  of  high  confidence  (seeding).  Alternately, 
the  phase  portrait  associated  with  the  trajectories  can  be  thought  or  as  a  “primal  sketch” 
of  the  potential  global  edge  structure  of  the  image.  This  structure  directly  represents 
simultaneous  multiple,  even  conflicting,  interpretations.  E.g.  there  may  be  more  than 
one  value  of  i  at  some  x  or  in  some  neighborhood,  which  gives  a  tenable  edge  locus.  The 
orbits,  i.e.  global  edges,  have  a  measure  assigned  to  them  by  the  Lagrangian  integral,  so 
there  is  a  ready  way  to  rank  and  prune  multiple  interpretations. 

As  long  as  “edgeness"  does  not  have  a  canonical  definition,  we  can’t  avoid  a  heuristic 
aspect  to  the  choice  of  Lagrangian.  The  particular  extremal  problem  we  have  suggested, 
however,  has  the  nice  property  that  the  integral  value  cannot  be  improved  by  arbitrary 
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the  magic  of  the  calculus  of  variations,  this  can  be  reduced  to  a  purely  local  condition  on 
the  trajectories,  given  by  the  Eulcr-Lagrange  equations:  d/dt(dL/dii)  —  dL/xi  =  0.  I.e., 
the  solutions  to  the  variational  problem  can  be  found  by  solving  the  system  of  equations 

dx _  . 

d_fdL\  dL  _  (Euler-Lagrange) 

dt\dxi)  dxi  ® 

Thus  for  our  situation,  all  we  must  do  is  define  an  appropriate  Lagrangian  function 
7/(i,  i).  Of  course,  this  will  be  related  to  the  local  “edgeness"  function.  Typically,  an 
“cdgencss”  function  is  the  result  of  applying  an  operation  which  measures  the  degree 
to  which  the  image  locally  resembles  an  edge.  E.g.,  one  might  convolve  with  a  family 
of  optimal  filters,  such  as  oriented  smoothed  steps;  the  output  would  be  an  “edgeness” 
function  depending  on  position  and  orientation. 

We  describe  one  candidate  for  such  a  Lagrangian  function.  Suppose  that  at  each  point  x 
of  the  image  we  have  computed  some  information,  perhaps  by  convolving  with  some  set 
of  operators;  call  this  information  h(z).  Define  H(x,  x)  to  be  the  event  that  there  is  a  local 
edge  of  magnitude  and  direction  i  at  the  point  x.  Then  with  some  assumptions  about 
the  noise  process  we  can  estimate  p{h(z)  |  li(x,  if),  the  probability  density  that  h(x) 
arose  as  a  “consequence"  of  li(x,x).  The  function  p(  •  |  IC(x,  x))  is  a  probability  density 
on  the  space  of  data  h(x),  for  each  Ii(x,x).  (Note  that  we  have  no  a  priori  estimates 
for  p(/£(x,  x)),  and  that  the  entire  event  space  need  not  be  U*, */?(*,  x).)  Then  we  can 
argue  that  for  a  set  of  points  along  a  contour,  wc  want  to  maximize  the  resulting  joint 
probability  density  for  all  the  points.  Assuming  independence,  this  becomes 

n<*(‘(o)  i  (*(<)»*(<))) 

t 

for  integer  t,  i.e.  a  finite  set  of  points.  This  can  be  conveniently  rewritten  as 
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A  Variational  Principle  for  Edge  Linking 

The  Field  of  edge  detection  has  seen  no  particularly  successful  consideration  of  global 
shape  (though  sec  [Marimont  1984]  for  some  recent  work  in  that  direction).  One  can  try 
to  Find  global  edge  contours  either  by  solving  for  global  information  directly  (e.g  finding 
a  level  set  of  some  function),  or  by  piecing  together  data  from  simple  local  operators, 
as  [Montanari  1970,  Montanan  1971,  Martclli  1972,  Martelli  1973]  did.  Here  we  offer  a 
variational  approach  for  the  latter  kind  of  contour  finding. 

The  essence  is  the  observation  that  there  is  a  formal  similarity  between  optimal  edge 
linking  and  Lagrangian  mechanics. 


Consider  a  trajectory  7  :  /  — ►  R2  in  the  image,  which  we  think  of  as  an  edge  locus.  We 
can  represent  a  local  ideal  step  edge  at  each  point  7 ({)  or  7(F)  as  a  vector  whose  direction 
and  magnitude  represent  those  of  the  edge.  (By  magnitude  of  edge,  we  mean  the  size  of 
the  step.)  This  establishes  a  correspondence  between  trajectories  in  the  plane  and  edge 
loci. 


In  the  Lagrangian  picture  of  mechanics,  the  state  space  (phase  space)  is  a  2n-dimcnsional 
space  of  2n-tuplcs  (xi,...,zn,zi,...,zn),  and  the  trajectories  through  this  space  must 
cxtreinizc  the  integral  /  l^x,  x)dt  and  satisfy  the  constraint  that  z  =  dx/dt.  Through 
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to  the  true  and  precise  location  of  the  real  edge  giving  rise  to  the  data.  Local  detection 
of  edges  is  not  an  end  in  itself,  but  only  the  first  step  in  the  process  of  contour  finding. 
The  process  of  assembling  the  local  edges  into  contours  will  confront  ambiguities  where 
it  is  not  clear  which,  if  any,  contour  a  local  edge  belongs  to.  The  coarser  the  resolution  of 
the  local  edge  parameters  (c.g.  position,  orientation)  the  more  frequently  ambiguities  will 
arise.  As  long  as  we  compute  the  local  edge  parameters  as  nonsingular  smooth  functions 
of  the  true  edge  parameters,  then  the  computed  values  will  be  samples  of  a  continuous 
function,  and  subpixei  resolution  will  serve  to  decrease  ambiguity. 
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to  find  the  parameters  Tor  a  vertically  elongated  region  (with  analogous  operators  for 
other  directions).  Square  operators  could  also  be  used. 

These  convolutions  yield  at  every  point  p  €  Z2  three  parameters  u(p),v(p),w(p),  deter¬ 
mining  a  plane  given  by  2  —  ux  +  vy  +  w  which  is  the  best  fit  to  the  data  L(/)  in  the 
translated  support  of  the  convolution  operators.  The  position  and  orientation  of  the  edge, 
i.c.,  the  parameters  of  «_((ax  +  by  +  c),  arc  given  by  finding  the  zero  crossing  line  of  the 
fitted  plane,  i.c.  by  solving  0  =  ux  +  vy  +  w. 

This  operator  gave  qualitatively  good  results  for  location  and  direction  of  edges  in 
numerous  real  pictures. 

Subpixel  localisation 

The  zero  crossing  parameters  found  by  the  above  method  give  an  edge  locus  to  subpixel 
precision.  For  an  ideal  edge,  with  sufficiently  low  noise,  this  is  an  accurate  estimate.  Real 
edges  are  not  ideal,  and  it  would  be  quite  fortuitous  if  the  nonideality  occurred  in  just 
such  a  way  as  to  make  the  subpixcl  estimate  accurate.  Nevertheless,  making  such  an 
approximation  for  subpixcl  location  is  useful,  even  without  knowing  that  it  corresponds 
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Of  course,  this  is  only  gdod  for  a  region  centered  at  the  origin.  For  a  region  with  arbitrary 
center,  we  can  apply  the  same  technique  modulo  a  translation  to  the  origin.  Equivalently, 
since  we  are  talking  about  a  family  of  regions  congruent  under  translation,  we  can  consider 
u,  v,  w  to  be  functions  on  Z2  expressing  the  parameters  of  the  plane  fit  in  local  coordinates 
centered  at  their  argument.  Then  we  have 


r  • r 

•  *M/) 

B  •  8 

1  *4/) 

11 


W  = 


This  permits  us  to  implement  the  least  squares  fit  as  convolutions  with  the  functions 
r.a.l. 

For  example,  using  the  lateral  inhibition  kernel 
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we  can  use  r,  s,  1  masks 
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Although  the  diagram  Fig.  *(•)  may  at  first  appear  rather  formidable  to  the  non-mathematician,  it  is 
actually  fairly  simple.  Nevertheless,  here  is  a  detailed  explanation. 

The  diagram  represents  the  relationships  among  various  maps  between  various  spaces.  To  be  exact, 
the  symbols  at  the  nodes  represent  spaces,  the  arrows  represent  maps  (i.c.  functions)  from  one  space 
to  another,  and  the  symbols  along  the  arrows  name  the  maps.  Sometimes  a  map  can  also  be  thought 
of  as  a  point  in  some  other  space,  but  that  is  not  represented  in  the  diagram.  Saying  the  diagram  is 
commutative  means  that  any  path  along  arrows  (concatenated  by  composition)  joining  two  spaces  gives 
the  same  result.  E.g.  the  following  diagram 


is  commutative  iff  h  —  g  o  /.  Diagrams  which  arc  noL  commutative  arc  generally  confusing. 

The  problem  of  delining  or  representing  the  general  surface  in  spaces  of  dim  >  3  is  not  trivial.  That  is 
because  the  surface  may  have  a  strange  configuration,  c.g.  it  may  close  on  itself  like  the  sphere  or  torus, 
or  it  may  wind  around  itself. 

The  simplest  examples  of  surfaces  arc  given  by  equations  of  the  form  *  =  /(*,  y),  i.c.  as  a  map 
/  :  Rz  -»  tt  which  we  can  interpret  as  assigning  to  each  point  on  the  plane  a  height  above  the  plane. 

There  is  at  present  a  divergence  between  Hie  mathematical  literature  on  the  one  hand,  and  the  engineering 
and  scientific  literature  on  the  other,  with  respect  to  the  notation  used  to  represent  functions.  This  is  a 
divergence  which  lias  developed  during  the  past  several  decades,  primarily  lier.auscof  the  mathematicians’ 
realisations  of  the  necessity  of  making  explicit  the  existence  iff  a  function  (or  map)  as  an  object  in  its  own 
right,  as  well  as  the  requirement  of  avoiding  various  ambiguities  which  otherwise  arise.  In  the  engineering 
literature  one  often  sees  references,  c.g.,  to 

“a  function  *  —  /(*,  y)  * 

For  many  pur|x>ses,  it  is  clear  enough  what  this  means.  However,  to  be  precise  and  avoid  confusions  we 
will  adhere  to  the  following  notations. 

f.X  —  Y 

will  mean  that  /  is  a  function  which  maps  points  in  the  space  X  to  points  in  the  space  Y.  (By  function, 
we  mean  a  single-valued  function,  or  an  assignment  rule.)  Additionally,  the  notation 

/:*•-*  V 


f  :  X  -*  Y 
x  y 

will  mean  that  /  takes  the  point  x  6  X  to  the  point  y  E  Yt  which  we  will  also  write  as 

V  =  /<*) 

Note  the  difference  between,  e.g.  z  and  X,  and  especially  the  different  meanings  of  the  2  types  of  arrows. 
In  thiN  notation,  rather  than  saying 


'the  function  »  =  /(*,  y) 


The  Mathematical  Structure  139 


Geometric  Methods  in  Vision 


we  will  lay  4 

/  :  R*  H 

(*,  y)  x 

(Here  R*  is  n-dimcnsional  Euclidean  space).  Here  the  function  is  /,  which  is  a  map  from  R*  to  R. 
f(z,  y)  is  the  value  of  the  function  /  at  the  point  (x,  y)  £  R2.  Notice  that  we  might  have  said,  e.g. 

(*.  V,*)*-*9 

(Sn  is  the  n-dimensional  sphere  given  by  1  =  *•*»  where  the  *,•  are  coordinates  in  Rn+1.  S2  is 

the  2-sphcre  (homcomorphic  to  the  surface  of  a  ball)  and  S1  is  the  circle.) 

We  can  think  of  this  ns  a  surface  by  considering  the  points  of  the  surface  as  given  by  the  graph  of  /, 
i.e.  by  {(z,  y,  z)  £  R'1  |  z  —  f(z,y)}.  We  can  describe  the  surface  as  a  function  /  :  R2  -*  R3  given  by 
the  formula  J(z,y)  =  (x,  y,  f{z,  y)).  Unfortunately,  most  surfaces,  e.g.  the  sphere,  cannot  be  described 
this  way.  Most  importantly,  no  matter  where  we  place  the  plane,  there  arc  usually  either  2  points  or 
0  points  of  the  sphere  above  any  point  on  the  plane.  One  can  remedy  this  by  delining  the  surface  as 
{(z,y,  z)  |  g(z,y,z)  =  0)  for  an  appropriate  function  g.  10. g.,  for  a  sphere  of  radius  r  one  would  take 
g(z,y,z)  =  x3  +  y2  +  z2  —  r2.  It  turns  out  that  one  can  essentially  gel  all  surfaces  this  way,  but  there 
are  unpleasant  side  elTects  which  cause  us  to  avoid  this  delinition.  To  guarantee  a  meaningful  concept 
of  dimension,  we  would  have  to  impose  extra  conditions.  Ilcsidcs,  finding  the  set  of  points  that  make  up 
the  surface  is  hard.  Instead,  we  define  a  surface  by  observing  that  a  little  piece  of  it  is  very  much  like  a 
little  piece  of  the  plane.  We  dclinc  a  patch  of  the  surface  U  C  R3  to  be  a  smooth  1-1  map  p  :  U 2  — *  R3, 
where  U2  C  R2  is  a  neighborhood  (i.e.  an  open  set)  in  R2,  such  that  <p~>  is  also  smooth: 


f"' 


Fig.  (patch) 

A  surface  is  then  defined  to  be  a  collection  of  such  patches  such  that  Tor  any  2  patches  (y?,  U),  [rf>,  V), 
n  V>(V))  is  an  open  set  in  R2.  This  guarantees  that  our  object  is  uniformly  2-diincnsional  and 
doesn’t  have  self-intersections.  Such  a  patch  is  often  also  called  a  chart  in  analogy  to  the  charts  of  the 
earth’s  sphere  used  by  mariners.  Similarly,  a  compatible  collection  of  such  charts,  covering  some  object, 
is  known  as  an  atiasi 

My  smooth  we  mean  continuously  difTcrcnliahlc  some  number  of  limes.  In  particular,  we  use  the  notation 
C"  to  represent  the  class  of  0-limos  continuously  differentiable  functions,  i.e.  continuous  functions;  the 
notation  Ck  for  i- limes  continuously  dilTerenliable  functions;  C’*’  for  infinitely  dilTerenlinblc  functions; 
and  Cu  for  analytic  functions,  i.e.  infinitely  differentiable  functions  representable  by  a  Taylor  scries. 
Usually  smooth  means  C°°,  but  sometimes  it  can  mean  Ck  Tor  some  (Unite)  k.  Usually  it  is  immaterial, 
but  if  it  matters  it  will  be  staled  explicitly. 

A  homeomorphism  is  a  1-1  continuous  map  with  a  continuous  inverse.  A  Cr  iifjtomorphism  is  a 
CT  homeomorphism  with  a  CT  inverse.  Often  we  will  not  specify  the  degree  of  smoothness  of  a 
diffeomnrphisin,  as  it  may  not  lie  ini|>orlanl  or  it  may  Is:  clear  from  context.  Nearly  everything  wc 
consider  can  be  thought  of  as  C°°,  and.wc  will  stale  when  this  is  not  so. 
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Fig.  (*)  is  meant  to  capture  some  of  the  basic  features  of  imaging  geometry.  The  object 
surface  wc  arc  looking  at,  £,  sits  in  3-dimcnsionaI  space.  That  3-dimcnsionai  space  has 
a  standard  projection,  e.g.  perspective  projection,  to  the  image  plane,  a  2-dimensional 
space.  Our  data,  i.e.  the  picture,  is  not  the  projection  of  the  object  in  the  image  plane, 
but  rather  a  color  or  brightness  function  defined  on  that  projection.  Meanwhile,  we  can 
change  our  viewpoint,  or  the  object  can  move.  We  view  this  “cgocentricaily,”  as  we 
explain  shortly,  and  take  this  as  a  motion  of  the  whole  world  while  we  stay  put  (wc  are 
not  considering  relative  motions  of  various  objects);  this  motion  is  g,  and  the  object  is 
carried  along  with  it,  rigidly  fixed  in  3-dimcnsional  space. 

Now  wc  must  consider  what  happens  to  the  picture  function.  The  complete  physics  of  the 
situation  is  that  the  observed  picture  changes  as  a  function  of  the  surface  orientation,  the 
lighting  direction,  and  the  observer  position,  as  embodied  in  the  image  irradiancc  equa¬ 
tion,  in  addition  to  undergoing  geometric  distortion.  We  have  lumped  the  photometric 
considerations  together  into  a  constant  cITcct  on  the  observed  image  irradiancc,  to  keep 
things  as  simple  as  possible  for  an  initial  analysis.  They  could  readily  be  included  by  us¬ 
ing  a  sphere  bundle  over  the  surface,  for  example,  to  .account  for  the  relative  positions  of 
observer,  surface,  and  light.  The  simplifying  assumption  we  have  made,  then,  is  that  the 
photometric  effects  of  change  in  viewpoint  arc  negligible  in  comparison  to  the  geometric 
ones.  This  is  frequently  a  reasonable  assumption,  as  is  evidenced  by  the  fact  that  wc  do 
not  often  experience  the  retinal  rivalry  which  occurs  when  the  assumption  is  violated.  Of 
course,  the  other  extreme  occurs  with  specular  reflection,  when  the  photometric  cITcctsare 
dominant.  The  consequence  of  this  assumption  is  that  observed  data  values  arc  carried 
along  with  the  object.  All  matching  systems  wc  are  aware  of  to  date  arc  also  predicated 
on  this  assumption  in  that  they  deal  only  with  features  rigidly  attached  to  objects.  Some 
can  mitigate  some  photometric  effects  by  using  features  such  as  “edges,”  but  edges  of 
specular  reflections  arc  still  disastrous. 

Wc  assume,  therefore,  that  the  picture  function  wc  detect  'ir  rigidly  fixed  to  the  object 
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surface.  This  can  be  thought  of  as  associating  picture  point  values  with  points  on  the  ob¬ 
ject  surface,  although  these  values  arc  really  derived  from  intrinsic  surface  characteristics' 
and  the  image  irradiance  equation.  This  fixed  association  is  specified  by  the  function  F. 
Then  the  distortion  gw  between  images  tells  us  how  the  2  pictures  Fi,  Ft  are  related.  We 
want  to  study  the  problem  of  finding  the  distortion  gw  and  the  motion  g  just  from  the 
data  Fi,  Fg. 

We  now  lay  this  out  in  more  detail.  First  let’s  consider  just  a  part  or  Fig.  (*),  shown  in 

F*g(5)- 


Roughly  speaking,  here  is  what  we  are  depicting.  The  surface  12  is  embedded  in  R3 
via  ».  x  is  the  imaging  projection  from  R3  to  R3,  the  image  plane.  Ft  is  the  observed 
image  intensity  on  some  closed  set  K\  of  the  image  plane,  and  F  is  the  intrinsic  surface 
“intensity”  giving  rise  to  F\,  i.c.  F  associates  observed  intensities  with  points  on  the 
object  12.  (In  what  follows,  we  will  assume  that  a  change  in  viewpoint  docs,  not  alter  this 
association,  i.c.  that  the  intensity  we  observe  behaves  as  if  it  were  an  intrinsic  surface 
characteristic.  This  simplification  is  justified  when  the  changes  in  viewpoint  we  wilt  be 
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considering  lead  to  negligible  changes  in  the  intensity  associated  with  a  given  point  on  the 
object  being  viewed.)  This  is  just  the  standard  imaging  situation,  slightly  generalized. 

To  be  more  precise,  we  consider  some  surface  embedded  in  R3  as  the  2-manifold  E 
embedded  via  the  injection  i.  Let  n  :  R3  — ►  R2  be  the  standard  projection  onto  the 
first  2  factors,  i.e.,  ir  :  (x,y,z)  (x,  y),  also  called  orthographic  projection.  Perspective 

projection  can  be  defined  as  a  map 

* :  )• 

Since  this  map  has  a  singularity  at  z  =  — 1  /k,  it  is  not  defined  on  all  of  R3.  Thus 
to  subsume  perspective  projection,  we  have  to  generalize  our  picture  slightly  (but  really 
without  changing  the  essence),  as  shown  in  Fig(y). 


M 3  is  a  fixed  3-dimcnsional  subset  of  R3,  and  is  the  domain  of  definition  for  the  imaging 
projection  jt,  which  maps  it  to  M2,  the  2-dimensional  image  space.  We  require  further 
the  physically  obvious  regularity,  condition  that  ir  be  a  C°°  submersion,  i.e.  that  its 
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derivative  be  everywhere  surjective.  Usually,  M2  C  M3  and  it  is  a  projection  in  the 
sense  that  x2  =  x.  These  conditions  are  all  satisfied  for  ray  optics  if  the  rays  do  not 
intersect  in  the  image  (as  might  happen  if  there  were  caustics,  e.g.),  i.e.  if  an  image  point 
corresponds  to  a  unique  ray.  For  orthographic  projection,  M3  —  R3  and  M 2  =  R2. 
Alternatively,  so  as  not  to  look  in  front  and  behind  at  the  same  time,  M3  could  be  the 
upper  half  space  of  R3.  For  the  usual  perspective  projection  geometry  given  above,  M3 
can  be  taken  to  be  the  same  upper  half  space.  In  this  case,  the  singular  plane  (containing 
the  pinhole)  is  behind  the  film  plane,  M 2  =  R2.  Another  imaging  geometry  is  interesting 
at  least  theoretically,  which  we  call  spherical  perspective  projection.  In  this  geometry, 
the  projection  can  be  looked  at  in  spherical  coordinates  as  the  map  x  :  [r,9,  tp) 

With  our  conventions,  M3  =  R3  -  0,  M 2  =  S2,  the  unit  sphere,  and  x  is  projection 
onto  the  sphere  along  the  line  to  the  center,  0.  In  this  case,  it  is  easy  to  see,  c.g.  that  the 
space  of  orientations  of  the  camera  is  isomorphic  to  the  rotations  of  the  unit  sphere. 

S i  and  Ki  arc  corresponding  visible  regions  of  U  and  the  image  space  M2,  resp.  More 
precisely,  let  S|  C  £  be  such  that  t(&i )  C  M3  and  it  :  t’(Si)  — ►  M2  is  1-1.  Then  let 
K\  =  rci  (S| ).  This  makes  all  the  pictured  maps  well-defined,  and  the  diagram  Fig. 
(t()  commutative. 

We  assume  that  the  surface  L'  admits  a  function  F  :  U  -*■  Rn  which  describes  intrinsic 
surface  features.  15.g.,  in  the  situation  F  :  \ C  — »  R1  (i.e.  r*  =  1),  F  can  be  thought 
of  as  representing  an  intrinsic  surface  brightness  or  luminance.  Thus  we  are  presently 
ignoring  the  clfcct  or  viewpoint  on  image  irradiancc,  or,  put  another  way,  we  arc  taking 
the  reflectance  function  to  be  constant.  To  the  extent  that  we  deal  only  with  small 
changes  in  viewpoint,  that  will  usually  be  a  good  approximation.  One  can  enlarge  the 
analysis  to  include  an  image  irradiancc  equation,  but  only  with  added  complexity,  so  we 
do  not  consider  Ibis  here.  If  one  wishes,  F  can  be  thought  of  as  the  intrinsic  surface 
property  albedo,'  and  nssumc  that  our  analysis  deals  with  quantities  that  depend  only 
on  albedo,  to  good  approximation.  For  the  case  n  >  2,  we  have  in  mind  color  images: 
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normal  human  cone  vision  incorporates  a  function  F\  K j  — ►  R3  (n  =  3).  Note  that 
we  also  subsume  cases  for  a  smaller  (i.e.  n  =  2)  or  larger  (  4  <  n  <  oo)  number  of 
passbands,  or  in  fact  any  surface  attribute,  such  as  a  multi-dimensional  texture  measure, 
which  can  be  thought  of  as  taking  pointwise  values  in  some  real  vector  space. 

We  now  make  precise  the  imaging  geometry  which  gives  us  the  observed  image  F\  from 
the  intrinsic  surface  function  F.  Basically,  we  want  to  say  that  we  see  the  frontmost 
surface  of  E  (given  i  and  x,  i.e.).  This  may  not  take  up  all  of  the  image  plane,  and  e.g.  if  E 
is  compact  then  its  picture,  xot'(E)  will  also  be  compact.  Since  x  is  a  submersion,  x~'(p) 
(for  7  E  M 2)  is  always  locally  1-dimcnsional.  We  assume  that  our  imaging  projection  is 
sufficiently  simple  that  x-,(p)  is  not  a  circle;  this  is  true  if  we  assume  light  travels  in 
straight  lines,  e.g. 

Here  is  an  example  of  a  map  ir  :  M3  — *  M!  which  conforms  to  all  the  requirements  we  have  made  until 
now,  but  for  which  rr~l(p)  is  a  circle.  Lot  Mi  be  the  solid  torus  Sl  X  D2,  where  D2  is  the  unit  disk.  Let 
n  simply  be  projection  onto  the  2nd  factor,  i.e.  ir (9,p)  >-»  p.  The  situation  is  illustrated  in  Fig.  (torus) 


Pig. (torus) 


All  our  regularity  assumptions  arc  clearly  satisfied.  And  *  1  (p)  r=«  Sl. 

Wc  assume  further  that  M2  and  M2  can  be  embedded  in  a  product  structure  such  that 
x  is  projection  on  one  of  the  factors  (wc  have  already  assumed  that  the  other  factor  is  a 
subset  of  the  line).  I.e.,  wc  assume  there  is  some  manifold  A  and  embeddings  et,e2  which 
make  the  following  diagram  commutative: 
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Fig.(prod) 


Since  wc  have  excluded  the  circle,  x~'(p)  has  a  well  defined  order  induced  by  the  usual 
ordering  on  the  line,  so  wc  can  define  the  closest  point  of  some  set  on  any  ray  as  a  least 
element,  so  long  as  the  intersection  with  the  ray  is  closed.  In  fact  it  is,  since  the  singular 
set  is  closed  by  virtue  of  being  the  inverse  image  or  a  closed  set,  while  the  inverse  image  of 
a  regular  value  is  dosed,  since  it  is  a  submanifold.  Using  regularity,  the  ordering  can  be 
extended  to  a  submanifold  (with  boundary).  (The  underlying  theory  is  presented  later.) 
Incidentally,  the  singular  set  of  w  o  i  is  also  called  the  silhouette  of  E,  since  it  comprises 
the  points  of  tangcncy  of  the  line  of  sight  to  the  embedding  or  E. 

Wc  arc  now  ready  to  discuss  the  more  involved  situation  of  Fig.  (*).  For  the  same  reasons 
that  wc  used  Fig.  (y),  wc  will  replace  Fig.  (*)  with  Fig.  (*'): 
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The  new  feature  in  this  picture  (beyond  Fig.  (^))  is  the  elTcct  of  change  in  viewpoint. 
A  change  in  viewpoint  means  that  the  imaging  projection  it  changes.  Let  x0  be  the 
projection  for  viewpoint  v0  and  irt  that  for  vlt  where  we  loosely  define  a  viewpoint  as  a 
location,  direction,  and  orientation  (we  might  tilt  our  head)  of  looking.  Then  is  just 
*o  preceded  by  a  change  of  coordinates.  I.e.,  - =  *o  °  :’R3  -*■  R3.  In  other 

words  we  can  describe  the  change  in  *  using  an  egocentric  view  where  it  is  constant,  but 
the  world  moves.  The  world-o-ccnlric  picture  is: 


Fig.  (world-o-ccntric) 
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This  is  really  just  a  contraction  of  the  egocentric  Fig.  (*),  so  we  will  use  the  egocentric 
model,  since  it  is  easier  to  make  things  explicit  that  way.  The  map  g  :  R3  -»  R3  is  the 
coordinate  change  in  R3  (the  ambient  space  in  which  our  objects  are  embedded).  In  fact, 
since  we  are  restricting  ourselves  only  to  coordinate  changes  resulting  from  a  change 
in  viewpoint,  we  do  not  want  to  alter  any  metric  properties,  i.e.  we  want  to  preserve 
geometry,  so  a  little  thought  should  persuade  one  that  the  possible  coordinate  changes  g 
arc  only  the  orientation  preserving  isometries  of  R3,  also  known  as  the  rigid  motions  of 
R3,  or  the  Euclidean  group  75(3).  After  we  have  applied  the  motion  g,  the  new  embedding 
of  21  is  given  by  it  —  g  o  t;  this  just  says  that  the  embedded  surface  got  carried  along 
with  the  motion. 

We  now  have  to  take  some  care  with  the  definitions  of  K i,  Kt,  Si,  and  St.  If  we  were 
to  use  just  the  definitions  of  2  copies  of  Fig.  ( j)  pasted  together,  gw  might  not  be  well 
defined,  since  we  could  not  be  sure  that  St  C  Sgt  or  equivalently  that  w  o  * f  E  —►  M 2 
is  1-1  on  S|,  since  for  many  surfaces  E,  different  viewpoints  g  have  different  domains  of 
visibility  of  21.  This  is  a  fancy  way  of  saying  that  part  of  what  we  saw  in  the  picture  Ft 
might  be  hidden  from  view  when  we  look  after  doing  g.  I  lencc  the  regions  Kt,Kt  must  be 
chosen  in  such  a  way  that  gw  is  well-defined.  For  example,  having  chosen  S\,Sg  as  above, 
we  can  define  S\  =  S9  —  S\  C\S9  and  K\  =  wot  (S',)  and  K't  =  w  o i9  (5^).  With  these 
restrictions,  g „  is  a  dilfeomorphism  Ki  -*  Kg  with  the  properly  that  f'\(p)  =  Fg(q)  = 
I'g(0w(p)),  which  is  the  same  as  saying  that  gw  is  a  deformation  of  the  picture  F\  into  the 
picture  Ft.  Note  that  this  observation  is  also  equivalent  to  asserting  that  the  diagram  is 
commutative  (for  the  Fi,gw,  F9  loop). 

Occlusion,  the  obscuration  or  one  part  of  surface  by  another,  occurs  at  the  singularities 
of  the  mappings  wo  i  and  w  o  t(.  Self-occlusion  can  occur  for  many  objects,  and  if  we 
allow  E  to  have  more  than  1  connected  component,  we  arc  able  to  subsume  all  cases  of 
occlusion.  The  complicating  feature  then  becomes  that  the  domains  of  smoothness  of  F\, 
Fg  arc  bounded  by  the  singular  sets,  and  an  important  problem  then  is  to  understand  the 


Geometric  Methods  in  Vision 


The  Mathematical  Structure  148 


singularities.  We  have  pot  considered  this  problem  here,  but  the  topology  involved  has 
been  well-studied  in  singularity  theory  and  catastrophe  theory  (see  [Arnold  1984,  Arnold, 
V.I.  1983]  for  expositions  of  the  theory  by  one  of  the  grandmasters,  and  [Kocndcrink  and 
van  Doom  1982,  Kocnderink  and  van  Doom  1976]  for  some  discussion  in  the  context  of 
vision). 
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A  Catalog  of  Applications 

Now  that  we  have  established  an  abstract  description  of  the  mathematical  setting  for 
vision,  we  can  indicate  how  the  usual  computer  vision  problems  fit  into  the  structure. 

We  consider  these  problems: 

•  Area  matching  stereo 

•  General  matching 

•  Motion  stereo  and  optical  flow 

•  Feature  based  stereo 

•  Singularity  tracking 

In  subsequent  sections,  we  will  prove  theorems  about  general  matching  and  optical  How. 
The  structure  wo  have  presented  comprises  dilTcrcntiablc  mappings  among  various  spaces, 
and,  emulating  the  Erlangcr  Programm,  a  group  rigid  motions  in  3-space.  The  mathe¬ 
matics  of  these  structures  is  differential  topology  and  differential  geometry,  so  we  turn  to 
the  tools  of  these  trades  for  our  analysis. 

For  general  matching,  the  central  result  is  that  unique  image  matching  requires  at  least 
2  color  dimensions,  unless  one  has  knowledge  of  imaging  geometry.  ([Ilcsnikoff  1974] 
studied  some  relations  between  color  and  geometry,  but  in  a  quite  different  context.)  The 
results  for  optical  flow  show  how  to  exploit  this  knowledge,  using  the  geometry  of  rigid 
motions  of  3-space  in  the  form  of  Lie  group  theory.  We  will  also  discuss  the  topological 
structure  of  images,  and  show  how  the  invariants  can  be  used  for  stereo  matching  as  well 
as  image  understanding  via  “scalo  space”  (Wilkin  1983]. 
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tereo 

or  the  stereo  problem,  we  assume  that  we  are  given  2  pictures  F\,Fg  arising  from 
unultancous  views  of  a  surface  in  R3.  There  may  be  some  constraint,  or  even  complete 
nowledge,  of  the  viewing  situations  which  gave  rise  to  the  images,  i.e.  we  may  have 
iformation  about  the  camera  model.  We  want  to  find  the  topography  of  the  surface. 

t  should  be  evident  that  the  situation  is  exactly  that  of  Fig.  (*),  with  the  restriction 
hat  Ki  and  Kt  arc  projections  from  the  single  surface  E  (wc  could  moot  the  restriction 
y  allowing  E  to  have  more  than  I  component;  however  that  creates  complications,  and 
ic  consider  the  simpler  case).  Then 

liven 


b\,  Ft  pictures 

Vc  want  to  find 

i  surface  embedding 

gw  picture  correspondence 

F  intrinsic  surface  characteristic 

ixamplcs  of  possible  viewpoint  constraints  are: 


t )  g  €  /i-'(3)  given 

2)  g  =  gt  for  some  t  €  R,  where 
gt  is  defined  by  -y  :  R  -*■  E( 3), 

0t{p)  =  7(0 (p),  P  6  R3 

3)  in  addition,  gl+t  =  gt  o  g, 

1)  gt  leaves  each  of  a  space- filling 
set  of  parallel  planes  invariant 
for  each  t 


camera  model  completely  specified 

viewpoints  on  a  1-parainetcr  subset 
of  E(3) 

viewpoints  on  a  1-paramctcr  sub¬ 
group  of  E( 3) 

viewpoints  arc  combinations  of 
translations  and  rotations  such  that 
cpipolar  lines  arc  well-defined 
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a  point  at  which  all  the  partial*  of  /  vanish.  The  z,-  here  are  coordinate  functions  on  Mm,  defined  for 
e  patch  <p  (we  omit  the  precise  definition,  which  can  be  found  in  any  differentiable  manifold  book). 

h  the  only  boundarylcss  l-manifolds  are  the  difTcomorphs  of  R1  and  S1  (the  circle) 
Inor  1965],  so  in  a  region  where  /  has  only  regular  values,  our  picture  is  essentially 
rect.  I.c.,  the  level  set  corresponding  to  a  regular  value  must  be  a  1-manifold.  Now 
need  to  know  that  almost  all  values  are  regular;  then  since  each  value  determines  a 
>I  set,  almost  all  level  sets  will  be  1-inanifolds  as  we  are  claiming.  But  so  far,  we  don’t 
n  know  that  there  has  to  be  arty  region  (i.e.  neighborhood)  free  of  critical  values.  In 
,,  if  A'|  is  a  constant  map,  then  clearly  all  of  U\  consists  of  critical  points. 

eorem  (Sard)  Let  /  :  Mm  —*■  Mn  be  a  Ck  mapping  between  the  m,  n-dimcnsional 
nifolds  where  k  >  max(m  —  n,  0)  (for  a  monochrome  picture  this  means 

*  0,  i.c.  /  is  difTcrentiablc).  Then  the  Lcbcsguc  measure  of  the  set  of  critical  values 
Mn  )  is  0. 

tnition  In  a  measure  space,  almost  all  means  all  but  a  set  of  measure  0. 
nark  In  a  probability  space,  almost  all  is  equivalent  to  with  probability  1. 

d’s  theorem  says,  in  other  words,  that  almost  all  values  arc  regular,  which  is  the  same 
:laims  I)  and  2)  above.  Note  that  it  is  the  critical  values  that  arc  or  measure  0,  not 
critical  points.  Thus,  Tor  us,  this  means  that  the  set  of  intensity  values  (but  not 
cssarily  picture  points)  taken  at  critical  points  (where  a  level  set  is  not  a  l-manifold) 
parse.  It  could  still  be  dense  however,  c.g.  if  there  were  critical  values  at  all  the 
onals. 

rypically,  pictures  have  isolated  critical  points  (i.c.  they  do  not  form  blobs,  lines,  or 
umulalions) 

tunatcly,  we  can  say  more.  There  are  certain  nasty  types  of  critical  points  called 
enerate  and  nice  ones  called  nondegenerate  (we’ll  define  them  in  a  moment).  One 
,hc  nice  things  about  nondcgcncratc  critical  points  is  that  they  arc  isolated,  i.c.  a 
degenerate  critical  point  has  some  neighborhood  which  contains  no  other  critical 
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The  Jacobian  of  /  is  really  defined  with  respect  to  some  pair  of  coordinate  systems  on  Mm  and  Mn.  Let 
a  patch  of  the  manifold  Mm  be  a  smooth  1-1  map  p  :  Um  — »  Mm,  where  Um  C  Rm  is  a  neighborhood  in 
Rm,  such  that  p~l  ia  also  smooth.  (This  is  just  like  our  definition  of  a  patch  of  a  surface  earlier.) 

Suppose  we’re  interested  in  the  Jacobian  at  a  point  z  €  Mm .  Then  let  ^  be  a  similarly  dcGncd  patch 
in  Mn  such  that  /(z)  6  r/i(Un),  i.c.  so  that  /  puts  z  into  the  right  region  for  rl>.  Then  o  f  op  is  a 
map  Rm  — ♦  Rn,  and  wc  can  speak  of  the  classical  Jacobian  of  this  map,  defined  as  follows.  The  Jacobian 
matrix  of  F  :  Rm  — ►  Rn  at  p  is  the  matrix  of  derivatives  dFi/dzj,  i.e. 


JP(F)  = 

Although  the  Jacobian  itself  depends  on  the  coordinate  systems  chosen  for  Mm  and  Mn,  its  rank  does 
not  (see  e.g.  fColubilsky  and  Guillcinin  1973]).  Thus  wc  can  speak  of  the  rank  of  the  Jacobian  of  /  above, 
even  though  by  definition  wc  just  presented,  the  Jacobian  itself  depends  on  the  particular  coordinate 
charts.  One  can  also  give  a  coordinate-free  definition  of  Jacobian,  where  the  Jacobian  of  f  is  the  derivative 
of  /,  a  map  between  tangent  spaces,  and  then  the  Jacobian  of  /  is  a  unique,  well-defined  object.  The 
interested  reader  can  find  the  details  in  any  lx>ok  explaining  differentiable  manifolds,  e.g.  (Abraham  and 
Marsdcn  1978,  Goliibilsky  and  (iuilleuiin  1973,  (inillemin  and  I’ullack  1971,  llirsrh  19711].  The  Jacobian 
is  nothing  more  than  the  linear  approximation  to  the  map;  or  it  can  be  thought  of  as  Lhc  linear  term 
in  the  Taylor  scries,  which  is  the  same  thing.  Thus  it  gives  information  on  what  the  map  docs  to  the 
degrees  of  freedom  in  tlic  domain  space. 

Here  arc  some  related  definitions: 


BFt,  . 
(p) 

day 


Definition.  Let  /  :  Mm  -*  Mn,  be  C1. 


1)  p  6  Mm  is  a  regular  point  of  /  if  the  Jacobian  of  /  at  p  is  of  maximal  rank. 


2)  p  €  Mm  is  a  critical  pain!  of  /  if  it  is  not  a  regular  |>oinl,  i.c.  if  the  rank  of  the  Jacobian  of  /  at  p 
is  less  than  maximal. 

3)  If  p  €  Mm  is  a  critical  point  of  /,  then  /(p)  €  Mn  is  a  critical  value  of  /.  Note  this  means  that 
q  6  Mn  is  a  critical  value  of  /  if  1  (<j)  contains  a  critical  |>oinl,  even  though  it  may  be  Lhat 
q  =  f(j/)  for  some  regular  point  p.  Also  notice  that  mountain  peak  height!  are  critical  values. 

1)  q  6  M"  is  a  regular  value  of  /  if  it  is  not  a  critical  value.  So  q  is  a  regular  value  ir  /~ 1  (<j)  contains 
only  regular  |ioiula,  or  if  q  is  not  even  in  the  range  of  /.  That's  t>ccausc  it’s  handy  to  have  only  2 
ly|>ex  of  points  in  M":  critical  and  regular. 


5)  /  is  an  immerstonal  p  if  p  is  a  regular  point  and  dim  Mm  <  dim  M". 


6)  /  is  a  s«6m«rjt<mal  p  if  p  is  a  regular  point  ami  dim  Mm  >  dim  Mn. 

7)  If  /  is  an  immersion  (submersion)  at  every  p  €  Mm,  then  it  is  simply  called  an  immenion(iubmertiot ). 


8)  /  is  an  embedding  if  it  is  an  immersion  and  a  homcomorphisin  onto  its  image.  [A  homeomorphiamn 
a  mapping  which  is  continuous  and  has  a  continuous  inverse.] 

There  arc  numerous  versions  of  the  implicit  function  theorem,  which  go  by  various  names,  the  most 
common  of  which  is  the  inverse  function  theorem.  The  above  version  is  one  or  the  most  general.  The 
theorem  is  frequently  stated  only  for  the  case  m  >  n,  and  the  condition  may  l>o  stated  in  terms  of 
regularity,  rank  or  singularity  (its  a  matrix  or  linear  map)  of  the  Jacobian  or  derivative,  traneversalityof 
/,  etc.  What  all  these  essentially  mean  is  that  at  the  point  in  question,  /  only  does  as  much  collapsing  as 
is  required  to  squeexc  things  into  the  dimension  of  the  range,  and  no  more.  Notice  that  mountain  peak 
height!  arc  critical  values. 


In  our  case,  wc  are  currently  dealing  with  the  situation  or  1  color  dimension,  so  wc  arc  interested  in 
Ft,  Ft  ■  M*  — »  R1.  Thus  the  theorem  tells  us  lhat  for  a  regular  value  y  of  F|  (resp.  Ft),  Ft  ~ 1  (y)  (resp. 
Fj~'(y))  is  a  1-dimensional  submanifold  of  /Ci  =  M*.  Note  that  for  a  function  /  :  Mm  — »  R1  the 
Jacobian  is  an  n  X  I  matrix,  so  a  critical  point  p  of  /  is  one  for  which 


«/ 

dxn 


(P)=  o 


*  1 
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Fig.  (frag) 


Ilcrc  is  the  gist  of  what  we  will  say  in  more  precise  terms. 

1)  Almost  every  level  set  of  a  picture  is  a  circle  or  a  line 

2)  These  I -manifolds  account  for  almost  all  of  the  brightness  values;  the  rest  arc  extrema 
or  saddles  (critical  points). 

3)  Typically,  pictures  have  isolated  critical  points  (i.c.  the  critical  points  do  not  form 
blobs,  lines,  or  accumulations). 

1)  and  2)  Almost  all  level  sets  and  brightness  values  arc  regular 

First  we  need  to  know  that  the  contour  lines  have  the  simple  structure  above.  To  this 
end  we  need  the  following  version  of  the 

Implicit  function  theorem  (see  c.g.,  [Ilrbckcr  and  bander  1975,  Nitccki  1971, 
Golubilsky  and  Cuillcmin  1973]).  Let  /  :  Mm  — *  Mn,  be  Cr.  (Where  Mk  denotes 
some  k  -dimensional  manifold.)  Then  /-,(y)  C  A/m  is  a  CT  submanifold  of  dimension 
max(m  —  n,0)  (or  empty)  if  the  Jacobian  or  /  is  of  maximal  rank  (i.c.  rank  min(m,n)) 
at  each  i  €  /_,(y). 

Note  that  /~ 1  (s)  cannot  be  self- into  race  ling,  since  it  is  a  submanifold. 
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differentiable  1-dimensicnal  objects.  That  in  turn  is  a  consequence  or  the  fact  that  the 
picture  is  a  map  from  a  2-dimensional  object  to  a  1-dimensional  object. 

Some  differential  topology  for  vision 

In  the  proof  of  the  first  part  of  the  theorem  the  facts  we  used  from  differential  topology, 
about  gradient  vector  fields,  flows,  and  contour  lines,  were  elementary  enough  that  wc  did 
not  have  to  go  into  much  detail  about  the  theory  behind  them.  For  higher  dimensions,  in 
particular  for  more  color  dimensions,  the  situation  is  more  difficult;  and  several  advanced 
ideas  arc  prerequisite  to  proceeding  with  the  rest  of  the  proof.  Also,  the  proof  we  just 
gave  for  the  monochrome  case  is  somewhat  technical,  so  we  would  like  to  illuminate  the 
intuitive  ideas  with  some  deeper  results.  The  rest  of  this  section,  therefore,  is  devoted  to  a 
review  of  some  of  the  necessary  ideas  of  differential  topology,  integrated  with  establishing 
(for  the  first  time  in  the  vision  literature)  the  aspects  of  vision  to  which  they  correspond. 
Wc  will  use  this  theory  in  later  sections  as  well;  moreover,  it  is  basic  to  the  geometric 
aspects  of  vision. 

First,  let’s  sec  how  the  proof  given  above  fits  into  the  intuitive  scheme  presented  earlier 
for  using  Fig.  (frag).  Then  wc  will  worry  whether  Fig.  (frag)  is  a  reasonable  picture  for 
the  contour  lines  of  a  picture  function.  The  <pp  which  we  defined  above,  considered  along 
the  dotted  line,  is  essentially  the  rotation  function  wc  discussed  earlier.  Wc  could  use  a 
bump  function  to  make  it  just  what  we  want  in  some  neighborhood  of  the  dotted  line, 
and  shear  could  be  eliminated  by  using  a  p  with  negative  as  well  as  positive  values.  It 
would  take  a  little  work,  but  it  could  be  made  to  do  the  right  thing. 
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Fig.  (boundary) 


By  a  standard  construction  (c.g.  [Abraham  and  Marsdcn  1978]),  there  is  a  C°°  function 
K i  — ►  R  which  takes  the  value  0  outside  of  U\  and  the  value  1  on  U 2.  Using  this  “bump” 
function  0  :  K 1  — ►  R,  we  get  a  vector  field  0-Z  on  K\  which  vanishes  outside  of  U\,  hence 
its  flow  never  leaves  K\,  i.c.  the  flow  <pt  of  the  vector  field  0  •  Z  has  the  properly  that 
<pt(p)  is  defined  and  lies  in  K\  for  all  p  6  Ki  and  —00  <  t  <  00.  lienee  for  any  such  t, 
<Pt  :  K j  — »  K 1  constitutes  a  diffcomorphism  with  contour  lines  invariant  on  K\.  In  fact,  it 
is  easy  to  see  that  this  family  of  dilTcoinorphisms  can  be  enlarged  even  more.  Notice  that 
multiplying  the  vector  field  Z  by  a  scalar  CT  function  p  :  K\  — ♦  R  does  not  alter  orbits. 
We  can  therefore  enlarge  the  class  of  dilfeomorphisms  pt  by  taking  all  diffeomorphisms 
<pt,p  given  by  the  flows  ol  p- 0-Z  on  K\.  Observe  that  for  any  constant  a,  <pat,p  =  'Pt.apt, 
so  if  p  is  a  constant  function,  <pt,p  =  <Ppt,i  =  <P\,pt-  Thus  {f>t,p}  =  {‘Pi.p})  so  by  abuse 
of  notation  we  will  write  <pp  for  <p\tP.  QED  (n  =  1). 

We  have  thus  far  proved  our  result  for  the  monochrome  ease.  In  summary,  we  have  shown 
that  in  matching  2  regions  free  or  occlusions,  i.c.  when  we  have  a  matching  difTcomorphism 
between  the  regions,  the  match  is  far  from  unique.  In  fact  there  are  essentially  as  many 
matches  as  Cr  functions  from  such  a  region  to  the  reals.  (One  has  to  factor  out  functions  p 
which  lead  to  equivalent  time-one  maps,  c.g.  those  which  rotate  contour  lines  by  multiples 
of  2ir.)  This  stems  directly  from  the  fact  that  the  iso-brightness  loci  constitute  connected 
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inner  product  on  an  orrentable  manifold).  One  might,  c.g.  define  the  new  vector  field 
Z  on  K\  by  Z(p)  =  (—b,a)  if  VFi(p)  =  (a,  b).  Note  that  Z  •  VFi  =  0  at  all  p.  Since 
smoothness  is  defined  with  respect  to  coordinates,  Z  has  the  same  degree  of  smoothness 
as  Wj.  Furthermore,  wherever  Z  0,  it  is  tangent  to  the  contour  lines  of  Fi,  so  that 
the  orbits  of  Z  arc  exactly  those  contour  lines,  and  the  critical  points  are  exactly  the 
critical  points  of  VFj.  We  now  want  to  consider  the  flow  <pt  :  Ki  -*  K i  of  the  vector 
field  Z. 


The  flow  <pt  !  M  — »  M  of  a  vector  rich)  on  the  apace  M  is  the  solution  to  the  initial  value  problem 
(Mined  by  consider! iir  the  vector  Held  as  a  system  of  dilTcrciilial  equations  on  M.  I.c.,  <pt  is  the  unique 
map  such  that  d/dt<pi(p)  =  v(p),  where  p  €  M  and  v(p)  is  the  vector  at  p.  The  flow  moves  the  space 
along  the  solution  lines,  which  arc  always  tangent  to  the  vector  Held.  Smoothness  of  the  vector  field 
guarantees  smoothness  (and  uniqueness)  of  the  llow.  The  lime-one  map  associated  with  a  flow  <pt  is  the 
diflcoinorphism  <p\  :  M  — *  M;  i.c.  a  snapshot  of  the  (low  at  one  particular  instant  of  time.  The  orbit  of 
a  point  (or  set)  p  under  the  flow,  is  the  set  of  all  values  of  for  all  t,  i.e.  —  oo  <  t  <  oo.  Notice 

that  the  flow  is  a  function  of  time  as  well  as  a  map  on  the  manifold;  this  is  a  slight  abuse  of  the  notation 
we  arc  using  for  functions. 


But  first  wc  have  to  deal  with  a  slight  problem,  viz.  that  near  the  boundary  of  K j,  the 
time-one  map  may  not  be  defined  if  a  contour  line  has  a  boundary.  To  overcome  this, 
wc  use  the  following  device  to  make  Z  vanish  near  the  boundary  of  K\.  Wc  find  open 
sots  U\,Ut  such  that  C.  U\  qTJ\  Q  K\.  V%  can  be  almost  as  big  as  K\,  since  wc 
can  choose  U\  and  t/2  as  follows.  Let  TJ\  lie  K\  —  V*,  where  V,  is  an  c  -  neigh  borhood  or 
the  boundary  of  K i,  where  by  c  -neighborhood  wc  mean  the  union  or  all  open  balls  of 
radius  <  centered  at  points  of  the  boundary  of  K\.  Then,  or  course,  U\  —  interior I7j. 
Similarly,  U\  can  be  slightly  contracted  to  yield  t/2.  If  we  assume  that  the  boundary  of 
K\  is  piecewise  smooth  (which  follows,  c.g.  if  the  picture  results  from  a  finite  number  of 
smooth  objects)  then  the  measure  of  U2  can  be  made  arbitrarily  close  to  that  of  Ki. 
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Fig.  (frag) 


Observe  first  that  if  :  K\  -*  K\  is  a  dilTcomorphism  taking  contours  of  F\  to  contours 
of  F\,  then  gw  is  a  matching  function  =*  g*  o  r!>  is  a  matching  function.  Define  rfr  as 
follows.  As  you  go  along  the  dotted  line 

t-1-k  i,  /  =  (0,1]CR 
t  >->  7(t) 

in  Fig  (frag),  slide  each  contour  along  itself  by  an  amount  0(t).  As  long  as  0  :  l  — ►  R  is  a 
dilTcomorphism  onto  its  image,  the  map  i/>  will  be  a  dilTcomorphism  in  a  neighborhood  of 
the  <lollcd  line.  To  the  extent  that  this  picture  is  valid,  there  will  be  as  many  matchings 
gw  o  tp  as  there  arc  such  maps  0. 

Actually,  we  arc  going  to  use  a  slightly  more  general  method  to  construct  a  family 
of  dilTeomorphisms  V’o  roughly  in  1-1  correspondence  with  the  set  of  all  Cr  functions 
AT |  — ♦  R.  For  this  we  will  use  a  canonical  vector  field  defined  along  the  contour  lines  of 
/'’i,  which  will  tell  us  how  much  to  slide  each  contour  line. 

Kirst  wc  observe  that  the  map  Ft  :  K\  — *  R  has  a  canonical  vector  field  associated  with  it,  Cite  gradient 
vector  field  W|  which  assigns  to  each  jK>int  p  6  K\  a  vector  VFt  (p).  VF|  is  always  orthogonal  to  the 
contour  lines  of  Ft  (with  the  usual  inner  product  on  K \  inherited  from  II*),  and  it  is  0  precisely  at  the 
critical  points  of  F\.  In  (2-dimcnsional)  coordinates,  V/  =  (df  /dx,  df  /Oy).  Clearly,  if  /  €  CT,  then 

vfecr~l. 

Define  a  new  vector  field  on  K\  by  rotating  each  of  the  local  vectors  of  VF|  by  +90°, 
i.c.  +90°  counterclockwise,  (which  is  uniquely  defined  because  we  have  a  globally  defined 
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If  n  >  3  (i.e.  the  picture  has  at  least  3  color  dimensions),  then  genericaily  there  will  be 
a  unique  gK  which  makes  the  diagram  commute. 

Once  the  problem  is  appropriately  formulated,  the  proof  yields  to  repeated  attack  by  some 
standard  machinery  of  differential  topology.  (An  excellent  introduction  to  the  subject  is 
(Guillcmin  and  Pollack  1974],  and  [Hirsch  1976]  is  a  good  reference.) 

Proof  (case  n  =  l).  For  the  time  being,  we  only  consider  monochrome  pictures  (n  =  1). 
We  will  return  later  to  the  situation  for  pictures  with  more  color  dimensions. 

The  idea  for  this  part  of  the  proof  is  fairly  simple;  the  difficulty  lies  in  establishing  when 
it  is  valid. 

The  map  Fi  :  K\  — ►  R  can  be  thought  of  as  a  topographic  landscape  on  JCi  C  R*,  where  intensity  is 
represented  by  altitude.  Consider  K(z)  —  Fi~ ’(*)  for  some  intensity  i  £  tt.  K(x )  is  an  iso-intensity 
contour  for  the  intensity  *  and  corresponds  to  an  elevation  contour  on  a  geographic  topographic  map. 

The  idea  is  this.  Observe  that  if  F2  °  ff»(p)  =  F|(p)  (i.e.  if  Fig  (or)  is  a  commutative 
diagram)  then  gw(l<\~l(x))  =  F2~l(x),  i.e.  y,  takes  contour  lines  to  contour  lines.  (Proof: 
Suppose  p  (=  /'’i ~ 1  ( x)  and  q  =  y„(p).  Since  F2{q)  =  /'’i(p)  and  F|(p)  =  *,<76  F%~l'(x). 
Thus  (Ik{F\ ~ 1  (z))  C  F2~'(x).  Similarly,  since  «/„  is  a  dilToomorphism,  (ff»)-,(/'a-,(*))  C 
A’l “*(*)  whence  F2~](x)  C  ffw(F|-,(z)).)  Conversely,  any  diffeomorphism  h  :  K\  -»  Kg 
which  takes  contour  lines  of  h\  to  contour  lines  of  Fg  satisfies  the  conditions  for  gw.  (Proof: 
[essentially  immediate:  We  want  to  show  that  h[Fi~l[x))  =  F2~i(x)  =»  F2(h{p))  =  Fi (p). 
Choose  p  €  K\,  and  let  *  =  F|(p)  so  that  p  G  Fi-1(z).  By  hypothesis,  h(p)  6  ^a-,(*)i 
so  F2(h(p))  =  z  =  Fi{p).  QED) 

Thus  far  we  have  shown  that  any  y,  taking  contour  lines  to  contour  lines  will  solve  our 
local  matching  problem.  But  how  many  such  gw ’s  can  there  be?  Assume  for  the  moment 
that  a  typical  contour  map  contains  a  difTcomorphic  image  of  the  fragment  represented 
by  the  solid  lines  below 


Geometric  Methods  in  Vision 


An  Application:  Stereo  by  General  Matching  156 


confine  our  attention  oply  to  a  difieomorphism  gw  :  K j  — »  K?  where  K\,Kt  are  both 
connected. 

The.  2-color  theorem 

Theorem  (2-color  theorem).  Stereo  requires  at  least  2  colors  or  3  dimensions.  I.c.,  for  a 
monochrome  picture,  general  matching  has  infinitely  many  solutions,  but  for  2  or  more 
color  dimensions,  it  is  generally  unique.  Hence  the  monochrome  case  requires  knowledge 
of  the  imaging  situation  to  constrain  the  problem. 

More  precisely,  consider  the  commutative  diagram 


where  gw  is  a  (71  difieomorphism  ,  Ft, Ft  arc  C’1,  and  K\,K-t  arc  compact. 

If  n  =  1  (i.e.  the  picture  is  monochrome),  then  3  an  infinite-dimensional  family  of  C1 
dificomorphisms  {hv}  such  that  replacing  gw  by  hv  also  results  in  a  commutative  diagram 
(i.e.  is  a  solution).  The  family  hv  is  parametrised  by  (at  least)  the  continuous  functions 
K\  —*  R,  and  contains  an  isomorph  of  a  neighborhood  of  the  identity. 

If  n  =  2  (i.c.  the  picture  hAs  2  color  dimensions),  then  gcncrically  there  will  be  a  finite 
number  of  g*  which  make  the  diagram  commute  (note  we  have  assumed  that  such  a  gw 
exists).  If  we  take  K\,K  2  to  be  rectangles  or  discs  (ns  in  a  usual  picture)  then  gcncrically 
there  is  a  unique  gw. 
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al.  1983]).  The  cpipolaf  geometry  consists  of  known  foliations  of  K\  and  K%  by  curves, 
with  the  image  of  each  such  curve  under  g*  also  known.  This  information  is  derived  from 
sources  other  than  general  matching;  e.g.  from  singularity  matching,  or  interactive  (i.e., 
human)  guidance,  used  along  with  assumptions  about  imaging  geometry,  such  as  optical 
characteristics.  Tart  of  our  purpose  is  to  understand  how  this  information  provides  a 
constraint,  and  ultimately  to  see  how  much  of  it  can  be  found  in  an  integrated  process. 

We  arc  not  proposing  here  to  use  exact  equality  of  point  brightness  values  as  a  matching 
criterion  for  stereo  vision  programs,  nor  to  ignore  imaging  geometry.  Rather,  we  are 
investigating  the  consequences  of  the  idea  that  there  is  some  function  describing  surface 
character,  maybe  not  the  data  itself,  which  manifests  itself  in  2  dilTercnt  distorted  pic¬ 
tures.  We  want  to  know  what  it  takes  to  find  that  distortion  in  principle.  We  see  this  as  a 
first  step  to  understanding  what  it  takes  in  practice,  where  there  arc  further  complicating 
factors.  We  consider  some  of  these  in  later  sections. 

The  following  question  then  arises: 

Problem  (Uniqueness  of  General  Matching).  If  we  arc  looking  for  an  arbitrary  (piecewise) 
C1  dilTcoinorphistn  gT  to  make  Fig.(GM)  commute,  under  what  conditions  arc  we 
guaranteed  a  unique  solution  to  the  matching  problem? 

H.g.,  if  F\ ,  /' 2  arc  both  constant  functions,  i.e.,  we  have  uniformly  gray  pictures,  the 
problem  is  completely  degenerate,  and  any  difTcomorphism  p,  is  a  solution. 

Since  there  may  be  occlusion,  K\  and  K 2  may  not  be  connected  regions.  We  don’t 
consider  the  (important)  problem  of  determining  the  connected  components  of  K\  and 
Kj,  i.e.  determining  the  occlusion- free  regions.  Suppose,  instead,  that  some  gn  exists 
fulfilling  the  above  criteria.  We  will  be  concerned  only  with  regions  where  gw  is  smooth, 
i.e.,  a  C1  difTcomorphism.  These  arc  regions  containing  no  occlusions  or  points  where  the 
surface  is  tangent  to  the  line  or  sight.  Furthermore,  we  do  not  consider  the  problem  of 
determining  which  arc  the  corresponding  connected  components  of  the  2  pictures.  We 
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This  means  that  every  point  in  one  region  is  matched  to  a  corresponding  one  in  the  other, 
in  such  a  way  that  the  2  picture  Functions  give  identical  brightness  or  color  values  on  cor¬ 
responding  points,  while  keeping  the  distortion  continuous  and  differentiable,  i.e.  main¬ 
taining  the  region  topology.  This  automatically  guarantees  matching  of  context.  Only 
after  the  matching  function  is  found  is  the  surface  embedding  computed  by  associating 
relative  depth  with  relative  disparity  at  each  point  of  (say)  K |.  In  this  approach  the 
matching  proceeds  without  any  knowledge  of  the  3-dimcnsional  structure  represented  by 
Fig.  (*).  Stereo  matching  programs  rarely  actually  try  to  solve  the  problem  in  this  pure 
form,  for  a  number  of  good  reasons.  In  the  first  place,  geometric  information  m  usually 
available,  and  some  of  it  is  often  used  to  constrain  the  matching.  In  fact,  as  we  will 
show,  this  is  necessary  to  achieve  any  success  for  unique  point  correspondence.  Secondly, 
programs  do  not  usually  use  simple  equality  of  brightness  values  as  a  matching  criterion 
(though  sec  [llakcr  1981)  for  a  use  of  essentially  that  criterion  as  an  interpolation  method 
for  regions  between  known  corresponding  points).  There  are  several  reasons  for  this. 
Various  sources  of  noise,  including  digitisation  as  well  as  electronics,  make  it  impractical 
to  look  for  exact  values  of  brightness.  There  can  be  variations  between  2  images,  such  as 
camera  settings  or  film,  properties,  jus  well  as  photometric  changes.  In  approaches  which 
use  area  matching  (e.g.  [Connery  1980])  one  frequently  uses  some  measure  of  similarity 
of  context  as  a  matching  criterion;  one  family  of  these  is  derived  from  cross-correlation. 
Part  of  the  art  of  these  measures  is  to  compensate  for  the  imperfections  we  have  just 
mentioned.  Nevertheless,  there  is  generally  the  assumption  that  there  is  some  underlying 
function  which  transforms  according  to  Fig.(GM);  although  this  function  may  not  be 
identical  with  the  data,  it  gives  rise  to  it. 

Frequently  one  assumes  that  the  2  images  Fj , Ft  arc  rectified,  i.e.  that  gK  takes  scan 
lines  to  scan  lines  in  a  known  way:  gw(x,y)  —  {g(x,y),y)  for  some  g  :  K\  — *  R.  This 
very  strong  constraint  on  imaging  geometry  is  rarely  valid  in  practice.  Instead,  one  can 
rely  on  knowledge  of  epipolar  geometry  for  an  additional  constraint  (sec,  e.g.,  [Maker  ct 
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An  Application:  Stereo  by  General  Matching 

As  an  application  of  the  abstract  viewpoint  we  arc  proposing,  we  show  that  for 
monochrome  images,  the  general  matching  problem  is  insoluble.  We  exhibit  the 
degeneracy,  and  show  that  additional  color  dimensions  allow  unique  solution. 

The  problem 

A  common  goal  of  stereo  matching  is  to  solve  the  correspondence  problem  Tor  some  region, 
i.c.  to  pair  corresponding  points  between  2  pictures  within  some  region.  A  pair  of  points 
in  2  pictures  correspond  if  they  arise  from  a  common  single  point  in  the  scene.  The 
correspondence  must  be  inferred  from  the  picture  functions.  There  have  been  many 
approaches  taken  to  do  this,  and  geometric  information  as  well  as  a  point’s  picture  context 
have  been  used  in  many  ways  to  make  the  inference.  One  of  our  ultimate  goals  is  to  build 
a  theory  which  gives  a  coherent  view  of  the  problem  and  the  methods  which  have  been 
used  to  attack  it. 

A  basic  need  is  to  understand  what  the  roles  of  geometry  and  context  arc  in  this  problem: 
how  much  can  you  tell  just  from  picture  distortion,  and  how  much  do  you  have  to 
know  about  the  way  the  image  was  former!?  To  shed  some  light  on  this,  wc  look  at 
the  stereo  problem  as  the  general  matching  problem.  I.c.,  given  2  picture  functions 
P'i,  Fj  :  M 2  — »  R”,  one  finds  regions  K\,K%  C  M3  and  a  1-1  matching  function 
g*  :  K i  -*  /fa  such  that  the  diagram  Fig.  (GM)  commutes. 
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neighborhoods  K+.  So,»e.g.  the  assumption  of  rectified  images  could  be  stated  as  the 
requirement  that  g„  take  horizontal  straight  lines  to  horizontal  straight  lines. 


Motion  stereo 

Instead  of  a  single  g  €  E(3),  we  have  a  1-parametcr  family  {gt},  given  by 


?:/-£(  3) 

t  gt 


such  that  <7o—l  (the  identity  in  /5( 3)),  where  I  ~  [0, 1]  C  R.  Given  is  a  corresponding 
family  of  pictures  Ft. 


It’s  common  to  consider  a  sequence  of  pictures  related  by  a  sequence  of  transformations 
{<7,  *  =  0, 1, . . .},  with  the  corresponding  family  of  pictures  I\.  This  can  be  thought  of 

as  a  special  case  of  the  above,  where  the  transformations  are  parametrized  by  a  discrete 
set: 


7  :  Z+  — ►  £(3) 


i  *-*  ff» 


Although  this  reflects  the  discrete  character  of  what  happens  in  practice,  the  former 
(continuous)  representation  makes  it  easier  to  exploit  the  temporal  smoothness  properties 
of  the  image. 


Fig.  (*)  illustrates  the  situation  for  only  2  gt  ’s. 
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5)  gt  is  a  translation  for  each  t  the  simplest  case  for  epipolar  lines 

These  constraints  consist  of  focusing  attention  on  subsets  of  E( 3)  having  particular 
properties. 


Area  matching  stereo 

To  differentiate  from  feature- based  stereo,  we  define  area  matching  stereo  by  requiring 
that  the  stereo  problem  be  solved  for  a  Tull-dimensional  part  of  the  surface,  i.e.  a 
neighborhood.  This  bears  an  implicit  assumption  that  area-supported  functions  F\,F9 
arc  used  directly,  and  that  some  intrinsic  area-supported  function  F  can  be  found.  An 
example  is  matching  of  areas  based  on  the  cross-correlation  function  between  the  2  picture 
functions  on  those  areas.  Feature-based  stereo,  by  contrast,  depends  on  lower-dimensional 
objects,  such  as  edges  or  critical  points  (precisely  defined  later). 


General  matching 


This  we  define  to  be  area  matching  stereo  without  any  knowledge  of  imaging,  described 
by  the  diagram: 


Fig.  (CM) 


Here  we  arc  only  given  Fy ,  /'a  and  the  problem  is  to  find  Ky,K2,  g„  such  that  the 
diagram  commutes.  There  may  be  constraints  on  g K  equivalent  to  those  Tor  stereo,  except 
the  constraints  can  only,  of  course,  be  stated  in  terms  of  the  dilfcomorphisms  between 
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points.  If  we’re  talking'about  a  region  contained  in  a  compact  set,  that  implies  a  finite 
number  of  nondegenerate  critical  points  and  a  minimum  spacing  between  them.  Similarly^ 
there  are  nice  functions  which  have  all  their  critical  points  nondegenerate;  these  are 
known  as  Morse  functions  (after  Marston  Morse).  And  finally,  almost  ail  Cr  functions 
are  Morse  functions  (we’ll  have  to  specify  what  we  mean  by  “almost  all”),  so  that  we  have 
a  justification  for  acting  as  if  all  our  critical  points  arc  nondegenerate.  Now,  here’s  what 
all  this  means  in  terms  of  our  illustration  of  level  sets  of  the  intensity  map  (Fig.  (frag)). 
Choose  a  picture  at  random.  (Say  the  picture  is  bounded  by  a  rectangle  R  with  interior 
V  .)  If  it  has  no  critical  points,  then  all  the  level  sets  arc  diffcomorphic  to  (disjoint  unions 
of)  line  segments  (and  not  circles,  which  arc  the  only  other  possibility  by  Milnor’s  result, 
cited  above).  (Proof:  By  contradiction.  Suppose  /-,(a)  C  V  is  dilTcomorphic  to  a  circle, 
so  that  it  bounds  a  disk  in  V .  The  closed  disk  is  compact,  so  /  must  take  a  maximum 
and  minimum  on  it.  If  one  of  these  is  not  on  the  boundary,  /“'(a),  it  is  a  critical  point 
of  V.  If  both  are  on  the  boundary,  then  the  entire  disk  consists  of  critical  points,  since 
the  maximum  and  minimum  arc  both  a.  Q1CD.)  Suppose  the  picture  docs  have  critical 
points.  Then  “gcncrically”  the  critical  points  arc  isolated. 

First  let’s  see  what  happens  near  such  a  critical  point.  By  Morse’s  launma  (stated 
completely  below)  we  know  that  there  is  a  coordinate  system  (u,v)  in  a  neighborhood  of 
the  critical  point  p  such  that  /  =  /(p)  ±  u2  ±  v 8.  The  possible  signs  correspond  to  a 

maximum  ( - ),  a  minimum  (++),  and  a  saddle  (H — ).  So  for  an  extremum,  it’s  easy  to 

sec  that  the  level  sets  are  just  a  point  surrounded  by  circles. 


I 
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one  of  the  critical  pointo  is  a  maximum  and  the  other  a  minimum.  But  from  the  Morse: 
inequalities,  there  must  be  a  saddle  somewhere,  too.  In  fact,  the  larger  picture  looks  like. 


Fig.  (dimple) 

When  there  are  two  maxima  (or  minima,  in  Australia),  the  picture  is 
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Definition.  Cg(M,  N)  is  the  space  of  all  r  times  continuously  differentiable  functions 
M  — *■  N,  with  the  so-called  strong  topology.  We  omit  the  definition  of  this  topology,  but 
only  mention  that  it  is  based  on  the  closeness  of  all  derivatives  from  the  Oth  (i.e.  the 
value  of  the  function  itself)  through  the  rth. 

Definition.  A  Morse  function  is  a  function  f  :  M  R  which  has  only  nondegenerate 
critical  points. 

Proposition.  Nondegenerate  critical  points  arc  isolated.  I.e.,  a  nondcgencratc  critical 
point  has  a  neighborhood  in  which  there  arc  no  other  critical  points. 

Definition.  A  nondegenerate  critical  point  is  one  where  the  Hessian  matrix  is  nonsin¬ 
gular.  This  basically  means  that  the  graph  of  the  function  is  not  flat  at  the  critical 
point. 

Definition.  The  Hessian  matrix  of  a  function  g  :  RB  — ►  R  at  a  point  p  is  the  matrix 


dag 

Oxidxj 


(?) 


Theorem  (Morse’s  Lemma).  Let  p  €  Mn  be  a  nondegcncratc  critical  point  of  index  k  of 
a  Cr+2  map  /  :  Mn  — ►  R,  with  I  <  r  <  w.  Then  there  is  a  CT  chart  (<p,U)  at  p  such 
that 

/ o¥>-,(u1,...,u„)  =  /(p)-]£u?  +  £  «? 

<i*=l  »■=  fc+l 

Definition,  p  €  Mn  is  a  nondcgencratc  critical  point  of  index  k  or  a  map  /  :  Mn  — ♦  R 
if  the  Hessian  of  /  at  p  has  k  negative  eigenvalues  (counting  multiplicities). 

Theorem  (Corollary  or  Morse  inequalities  and  Theorem  or  Ilopf).  Let  /  :  M"  — *  R  be  a 
Morse  function  on  a  compact  manifold  without  boundary,  with  t/fc  critical  points  of  index 
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k,  0  <  k  <  n.  Then  * 

fc— o 

where  x(Afn)  is  the  Euler  characteristic  of  Mn. 

Open  dense,  usually,  generically,  almost  all,  typically 

The  key  result  of  the  theorem  above  is  that  the  Morse  functions  are  open  dense.  This 
allows  us  to  restrict  our  attention  only  to  pictures  whose  critical  points  arc  isolated  and 
thus  to  avoid  considering  pathological  behavior. 

Here’s  why.  Instead  of  discussing  only  pictures  and  Morse  functions,  we  will  talk  about 
dense  open  subsets  of  Cr{M,  N)  generally,  since  the  scope  then  includes  things  like  general 
position  as  well  as  other  properties,  with  no  extra  dilficulty.  Suppose  some  open  dense 
set  consists  or  functions  which  all  have  some  nice  property.  (We  will  call  such  a  property 
generic.  Often  generic  is  defined  with  respect  to  a  countable  intersection  of  open  dense 
sets,  but  for  us  open  dense  is  enough.)  Then,  as  a  consequence  of  density,  any  function 
in  CT(M,N)  is  arbitrarily  close  to  a  nice  one,  hence  can  be  arbitrarily  well  approximated 
(with  respect  to  all  r  derivatives)  by  a  nice  one.  Or  course,  we  need  more  than  density 
to  be  justified  in  saying  “most.”  Dense  sets  can  have  measure  0.  For  example,  both  the 
ralionals  and  irrationals  arc  dense  in  R,  yet  we  don’t  want  to  say  that  most  numbers  arc 
rational.  Requiring  that  the  set  be  open  dense  solves  this  problem  (although  note  that 
the  irrationals  aren’t  open  either,  though  they  arc  a  countable  intersection  of  open  dense 
sets  (vis.  n?G  q(R  —  7),  where  Q  is  the  rational  numbers)). 

Actually,  it  does  much  more.  For  one  thing,  it  allows  us  to  completely  neglect  any 
functions  which  aren’t  nice:  Suppose  we  decide  that  functions  having  a  nice  property  arc 
open  dense.  Then  we  decide  that  the  same  is  true  of  some  other  nice  property.  We’d 
like  to  have  both  properties,  of  course,  which  cannot  be  guaranteed  011  the  basis  of  only 
density.  Rut  the  intersection  of  a  finite  number  of  open  dense  sets  is  open  dense.  We 
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don’t  use  probability  because  there  is  no  natural  measure  for  Cr(M,N),  and  no  natural 
probability  distribution.  We  would  like  to  say  that  in  a  measure  space  of  total  measure  1, 
an  open  dense  subset  is  also  of  measure  1,  but  strangely  enough,  even  though  open  dense 
sets  are  very  dense  indeed,  this  does  not  have  to  be  so.  For  example,  one  can  remove 
a  Cantor  set  of  positive  measure  from  the  unit  interval,  leaving  an  open  dense  set  of 
measure  less  than  1. 

Also  gcncricity  is  related  to  stability.  There  arc  numerous  definitions  of  stability;  we 
are  concerned  with  structural  stability.  A  function  /  €  Cr(M,  N)  is  structurally  stable 
with  respect  to  some  equivalence  relation  (e.g.  topological  equivalence)  if  all  sufliciently 
small  perturbations  of  /  (relative  to  the  Cr(M,N)  topology)  result  in  an  equivalent 
function.  In  other  words,  /  is  not  a  freak,  destroyed  by  the  least  perturbation.  With 
respect  to  the  equivalence  determined  by  possessing  the  generic  property,  the  openness 
of  generic  sets  makes  a  function  with  a  generic  property  structurally  stable.  In  other 
words,  small  perturbations  of  /  do  not  affect  the  presence  of  the  generic  property.  And 
with  respect  to  some  other  equivalence,  the  density  guarantees  that  a  structurally  stable 
function  will  be  equivalent  to  a  generic  one.  Usually,  structural  stability  is  defined 
with  respect  to  some  topological  equivalence  relation.  E.g.,  we  can  define  2  functions 
f,g:M-*Nlo\ic  topologically  equivalent  ir  there  is  a  homeomorphism  h  :  M  M 
such  that  /  =  gh.  This  is  the  situation  of  Fig.  CM.  Notice  that  topological  equivalence 
guarantees  that  topological  properties  will  be  shared,  e.g.  h  takes  level  sets  of  /  to 
those  of  g,  so  the  structural  stability  (with  respect  to  this  topological  equivalence)  of 
/  would  guarantee  that  the  level  set  structure  topologically  remains  unchanged  under 
small  perturbations.  Now  the  perturbation  could  also  be  derived  from  motions  of  the 
observer,  and  wc  might  be  interested  in  some  other  feature,  e.g.  a  derived  boundary. 
For  the  purpose  of  analysing  the  picture,  wc  would  probably  want  to  focus  attention  on 
boundaries  whose  topological  structure  didn’t  change  with  small  changes  in  viewpoint, 
or  in  the  picture  (e.g.  noise),  hcn«c  wc  would  want  to  focus  attention  on  generic  pictures 
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and  structurally  stable  features.  Of  course,  we  are  interested  in  more  than  just  topology 
in  analysing  a  picture,  so  that  is  not  all  we  would  have  to  consider,  but  it  is  a  first  cut 
at  separating  wheat  from  chaff. 

Multiple  color  dimensions:  the  cases  n  >  2 

We  now  carry  on  with  the  proof  of  the  2-color  theorem  for  higher  dimensions 

Let  /  :  Afm  — *  Mn,  be  CT  and  regular  at  p.  The  analysis  is  based  on  the  fact  that 
at  a  regular  point,  if  there  is  enough  room  in  the  range  space,  /  is  a  diffcomorphism 
from  a  neighborhood  U  or  p  to  f(U).  This  is  yet  another  version  of  the  implicit  function 
theorem.  The  idea  of  enough  room  can  be  made  precise  simply  by  requiring  the  Jacobian 
to  be  1*1.  This  is  the  case  for  a  regular  point  if  the  dimension  of  the  range  space  is  at 
least  that  of  the  domain  space,  i.e.  if  m  <  n,  which  is  the  situation  for  us  if  there  are  at 
least  2  color  dimensions. 

As  before,  the  possible  maps  gw  which  solve  the  matching  problem  arc  exactly  those  which 
take  level  sets  to  level  sets.  Since  the  g „  arc  dilTcomorphisms,  we  can  just  study  the  maps 
of  the  level  sets  of,  say,  F\,  since  they  arc  equivalent  by  a  given  gw  to  the  set  of  all  g„. 
(To  sec  this,  consider  Kig.  (equiv).  Let  h  be  a  diffcomorphism  which  takes  level  sets  to 
level  sets,  i.e.  which  makes  the  diagram  commutative,  and  define  </„  —  g*  o  h,  so  that 
any  h  gives  us  a  Likewise  given  such  a  g'n,  define  h  =  g~l  o  g^.) 
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So,  without  any  loss  of  generality,  we  can  restrict  our  attention  to  the  part  of  Fig.  (equlv) 
shown  in  Fig.  (equiv'),  where  we  have  replaced  the  notation  K\  by  A/3  to  help  keep  in 
mind  that  we  are  considering  a  2-dimensional  region: 


Fig.  (equiv') 

Wc  pointed  out  earlier  that  any  h  which  satisfies  our  conditions  must  take  level  sets  to  level  sets.  If  F\  is 
l-t  Tor  some  point  q  €  R",  then  the  level  set  Tor  that  point  is  just  a  single  point,  and  there  is  no  choice 
in  what  k  can  do:  following  the  lefthand  F\  arrow  backwards,  and  likewise  the  righthand  one,  we  see 
that  h  must  take  the  single  point  p  =  to  itself  and  no  other.  So  the  question  of  the  uniqueness 

of  h  becomes  one  of  studying  how  Fi  can  Tail  to  be  1-1. 

First,  let’s  look  at  how  many  points  can  be  in  F7!(p).  My  the  implicit  function  theorem, 
since  t,he  dimension  or  the  range  (i.c.  the  color  space)  is  at  least  that  or  the  domain,  the 
level  set  or  a  regular  value  is  at  most  a  discrete  set  or  points.  Since  wc  arc  restricting 
ourselves  to  compact  pictures,  the  level  set  must  he  a  finite  sot  (to  avoid  an  accumulation 
point),  lienee  on  a  level  sot,  gw  is  constrained  to  be  one  or  a  finite  number  or  permutations 
of  the  finite  level  set.  Furthermore,  since  l‘\  is  a  local  dilTcomorphism  at  a  regular  value, 
the  permutation  cannot  jump  around  wildly  among  neighboring  points,  so  that  in  Tact  gw 
is  a  permutation  or  “sheets.”  I.c.,  let  p  be  a  regular  value  or  l<\.  Then  Fj"l(p)  =  q%,  *  £  Z. 
And  there  is  some  neighborhood  U  of  p  such  that  FJ*[U)  =  Vi,  g,  €  Vi,  and  the  Vi  arc 
disjoint.  The  Vi  arc  then  said  to  belong  to  diUerent  sheets,  and  the  efTcct  or  gw  is  to 
permute  the  V,.  It  may  happen  that  there  is  a  path  or  regular  points  joining  qj  to  g*, 
so  that  there  is  no  global  sheet  us  a  set  or  points,  although  one  can  make  an  arbitrary 
partition  (as  is  the  case  for  the  familiar  integral  power  function  in  the  complex  plane). 
The  sheets  may  be  separated  (and  conceivably  punctured)  by  critical  points.  Thus  wc 
arc  led  to  consider  the  topology  of  the  critical  sets,  and  the  cardinality  of  the  level  sets. 
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Fortunately,  others,  notably  Thom,  Boardman,  Mather,  and  Whitney  were  led  to  consider 
the  same  questions,  beginning  in  the  1950’s,  and  we  now  proceed  to  use  some  of  their 
results.  As  it  turns  out,  the  higher  dimensions  are  easier  to  deal  with  in  our  context,  so 
we  will  start  with  them. 

f 

Regular  points  when  n  >  3 

We  arc  interested  in  studying  how  F\  can  fail  to  be  1-1.  We  know  from  the  implicit 
function  theorem  that  because  n  >  m,  F\  is  locally  t-1  at  regular  points.  In  other  words, 
F\  is  a  local  embedding  of  its  regular  set  into  Rn.  Hut  it  may  not  be  a  global  embedding, 
since  the  image  may  be  self-intersecting.  It  is  precisely  at  these  self-intersection  points 
that  F\  fails  to  be  1-1  on  the  regular  set.  For  a  regular  p,  Ff'(p)  consists  of  isolated 
points,  so  we  can  consider  intersections  of  regular  neighborhoods.  What  do  these  look 
like? 

Theorem,  (sec  c.g.  [Ilirsch  1976])  Ix:t  M,  N  be  embedded  submanifolds  of  Rn.  Then 
gcncrically,  dim  M  4-  dim  N  —  n  =  dim  M  D  N,  where  a  negative  dimension  means  the 
intersection  is  empty. 

We  are  interested  in  the  case  where  M  and  N  arc  the  images  in  Rn  of  regular  neigh¬ 
borhoods  in  Af’n,  so  dim  M  =  dim  N  =  2.  From  the  above  theorem  we  sec  that  the 
intersection  is  gcncrically  of  dimension  2,1,0,  and  empty  for  n  =  2, 3, 4,5  resp.  Thus 
if  n  >  3,  the  intersection  set  is  gcncrically  of  lower  dimension  in  the  embedded  regular 
sets,  h  must  be  the  identity  other  than  on  the  intersection  (since  elsewhere  F\  is  1-1), 
and  since  removing  a  lower-dimensional  subset  leaves  a  dense  set,  h  is  gcncrically  1-1  on 
a  dense  subset  of  the  embedded  regular  sets.  The  continuity  or  h  then  guarantees  unique 
continuation  to  the  intersection,  and  there  is  again  no  choice  in  the  behavior  of  h:  it  must 
be  the  identity.  So  for  the  regular  points,  we  have  disposed  of  all  the  cases  of  3  or  more 
color  dimensions.  Now  we  look  at  the  critical  sets,  and  their  dimension. 


The  genericity  of  Morse  functions  can  be  generalised  as  follows. 


1 
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Theorem(Critical  set  dimension).  For  an  open  dense  subset  of  Cg5(Mm,  Mn),  the  set 
of  points  of  Mm  where  the  Jacobian  of  /  is  of  rank  r 

1)  comprise  a  submanifold  of  Mm 

2)  =  0  if  (m  —  r)(n  —  r)  >  m 

3)  is  of  codimension  (m  —  r)(»  —  r)  in  Mm  if  (m  —  r)(n  —  r)  <  m  • 

(X  is  of  codimentionk  in  Y  if  dim  X  +  k  =  dim  V.) 

Before  we  get  involved  in  studying  the  critical  sets  for  various  color  dimensions,  we  state 
2  more  closely  related  theorems  which  allow  us  to  immediately  understand  the  situations 
for  4  or  more  color  dimensions.  An  immediate  consequence  of  the  critical  set  dimension 
theorem  is  the 

Theorem  (Whitney  Immersion  Theorem).  If  X,Y  arc  smooth  manifolds,  with  dim  Y"  > 
2  -  dim  X,  then  maps  with  no  critical  points  arc  open  dense  in  C°°(X,Y). 

For  a  picture,  dim  X  =  2,  so  the  above  theorem  applies  when  there  arc  at  least  4  color 
dimensions.  In  that  case,  it  stales  that  the  typical  picture  won’t  have  any  critical  points 
at  all.  Hence,  typically  there  is  only  one  “sheet"  and  no  folds. 

A  further  result  is  the 

Theorem  (Whitney  1*1  Immersion  Theorem).  If  X,Y  arc  smooth  manifolds,  with 
dim  Y  >  2  •  dim  X  +  1,  then  l-l  maps  with  no  critical  points  arc  residual  (i.c.  generic) 
in  C°°{X,  Y). 

So  with  at  least  5  color  dimensions,  we  can  assume  no  color  is  used  twice. 

Returning  to  the  critical  set  dimension  theorem,  in  our  case,  m  =  2,.  so  what  the 
theorem  tells  us  is  that  the  dimension  of  the  critical  set  is  respectively  1,0,  and  empty 

for  n  =  2, 3, 4. 
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By  reasoning  as  we  did* for  multiple  points  of  the  regular  set,  h,  the  diffeomorphism  of 
Fig.  (equiv')  which  leaves  the  picture  invariant,  has  unique  continuation  to  the  critical 
set  for  n  >  3,  yielding  the  conclusion  that  h  is  gcncrically  unique  when  n  >  3  (for  n  =  2 
the  1-1  set  need  not  be  dense,  so  the  conclusion  wouldn’t  follow). 

To  summarise,  we  have  thus  far  shown  that  h  must  be  the  identity  for  n  >  3,  and  is  at 
worst  one  of  a  discrete  set  of  sheet  permutations  for  n  =  2.  Now  we  will  pursue  the  case 
n  —  2  a  bit  further. 

ir  we  allow  the  support  of  a  picture  to  be  ail  of  R2  or  S2,  that  is  all  we  can  say.  (Consider, 
c.g.,  the  function  z  e*  zk  (for  some  k  >  2)  on  the  complex  plane  for  the  picture  function. 
Then  the  sheets  can  be  permuted  leaving  the  picture  invariant.)  But  a  real  picture  must 
be  finite  in  extent,  so  if  we  are  considering  subsets  of  the  plane,  a  rectangle  (i.c.  a  disc) 
is  an  appropriate  domain  to  consider.  If  we  arc  thinking  about  the  sphere,  then  since 
we  arc  restricting  ourselves  to  occlusion-free  regions,  using  the  entire  sphere  would  imply 
that  there  were  no  observable  occlusions,  which  could  only  happen  in  the  improbable 
events  that  only  one  object  was  illuminated,  or  that  the  observer  could  only  sec  an  object 
which  completely  enclosed  him.  Itiglil  now  we  are  only  concerned  with  the  gcncricily  of 
mappings  or  the  plane,  since  we  arc  in  the  context  of  general  matching,  so  we  will  make 
no  claims  regarding  the  gcncricily  or  occlusion  or  illumination,  though  such  an  analysis 
is  possible. 

Let  us  now  assume  that  the  picture  support  we  arc  considering  is  topologically  a  disc. 
Then  h,  being  a  homeomorphism,  must  map  the  boundary  of  the  disc  (a  circle  S)  to 
itself.  If  /  is  1-1  then  h  must  be  the  identity.  If  not,  then  consider  what  must  happen 
on  this  circle,  h  must  be  continuable  along  S,  so  Tor  p  €  S,  f~i(p)  must  contain  a 
constant  number  of  points.  This  excludes  the  possibility  of  transverse  crossings  of  f(S). 
But  transverse  crossings  for  such  a  map  arc  generic,  so  h  must  therefore  gcncrically  be 
the  identity.  QED  I 
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>t  does  the  2-Cojpr  Theorem  really  mean? 

ght  seem  that  we  have  shown  that  monochrome  stereoscopic  vision  is  impossible.  But 
y  people  with  normal  binocular  vision  have  experienced  stereoscopy  with  monochrome 
ires  such  as  aerial  surveys,  stick  figures  of  molecules,  random-dot  stereograms  (Jules* 
,  Jules*  1071],  etc.  What  is  more,  there  is  evidence  that  color  is  not  important 
uman  slcrcopsis  [Gregory  1977].  Machine  stereo  systems  have  been  confined  to 
ochromc  pictures,  and  though  they  have  not  approached  human  performance,  they 
'  been  successful  in  extracting  usable  depth  information. 

it  led  to  the  2-Color  Theorem  was  the  observation  that  a  change  in  viewpoint  leads 
complicated  distortion  of  an  object’s  picture.  This  distortion  depends  on  surface 
c,  viewpoint  change,  and  imaging  geometry  and  optics.  The  problem  was  to  deduce 
Jislorlion  from  the  data,  i.c.  to  solve  the  correspondence  problem.  What  we  studied 
the  degree  to  which  this  problem  can  be  solved  purely  from  the  topology,  without 
idcring  the  extra  complexities  of  many  possible  geometric  constraints.  We  confined 
attention  to  open  sets  free  of  singularities,  i.c.  areas  without  occlusions,  which  is  of 
sc  only  a  part  of  the  stereo  vision  problem. 

were  able  to  show  that  gcncrically  the  monochrome  problem  is  highly  degenerate, 
we  characterized  the  degeneracy.  Kor  the  color  iroblem,  however,  it  turned  out  that 
iy  topological  considerations  were  enough  to  (gcncrically)  solve  the  problem,  and  that 
geometric  information  was  therefore  redundant.  The  conclusion  is  that  there  is  a  big 
rence  between  monochrome  and  color  stereo;  monochrome  stereo  requires  and  must 
folly  incorporate  geometric  constraints  to  succeed  in  matching,  while  color  stereo  is 
iblc  without  this,  and  can  therefore  use  the  imaging  geometry  in  a  different  strategy. 

tiave  considered  generic  properties,  so  there  can  be  infinitely  many  exceptions.  But 
s  our  results  arc  Tor  a  generic  subset  of  functions,  they  remain  valid  for  small  per- 
ations,  and  since  generic  sets  arc  dense,  every  function  can  be  approximated  by  a 
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generic  one  to  arbitrary  precision.  The  results  are  about  degeneracy,  and  the  exceptions 
are  invariably  more  degenerate.  (This  is  simply  because  the  exception  sets  arc  the  inverse 
images  of  closed  sets,  e.g.  places  where  a  determinant  ‘ib  0.)  This  means  there  are  no 
special  cases  of  monochrome  pictures  that  are  less  degenerate.  But  it  is  possible  to  find 
more  degenerate  cases,  so  of  course  color  pictures  do  not  have  to  be  uniquely  matchable 
in  special  cases,  the  simplest  of  which  is  a  region  of  constant  color.  Actual  data  contains 
noise  and  nonidcalilics,  and  digitization  introduces  degeneracy,  so  of  course  even  with 
color  one  cannot  expect  perfect  matching  in  a  real  world  program,  and  surely  one  would 
still  want  to  use  constraints  of  imaging  geometry  to  help  the  solution. 

On  the  other  side  of  the  coin,  a  generic  monochrome  picture  has  isolated  critical  points, 
and  a  finite  number  of  them  for  a  bounded  region.  Since  critical  points  must  match 
critical  points,  finding  this  match  is  a  combinatorial  problem,  which  is  made  easier  since 
the  critical  points  have  other  attributes  which  arc  invariant,  as  we  have  discussed  in 
the  earlier  section  on  differential  topology  for  vision,  and  as  we  will  discuss  later  in  the 
section  on  topological  invariants  of  the  picture  function.  One  can  say  essentially  the  same 
thing  about  level  sets,  mutatis  mutandi.  I*'or  a  stereo  pair  or  stick  figures,  then,  most  of 
the  matching  is  between  singular  points  associated  with  places  such  as  branchings  and 
terminations.  The  sticks  themselves  arc  individual  level  sets.  Along  those  level  sets,  any 
stercopsis  must  come  either  from  special  knowledge  of  imaging  geometry,  i.c.  the  cpipolar 
geometry  relating  the  2  retinas  for  a  given  state  of  convergence,  intcrocular  distance, 
focal  length,  eye  rotation,  retinal  position,  and  focus,  or  additionally  from  gcslaltist 
assumptions  made  by  the  visual  system,  such  as  an  assumption  or  maximal  simplicity. 
E.g.,  2  horizontal  black  lines  of  different  lengths  presented  one  to  each  eye  give  no  relative 
depth  information  about  the  region  between  endpoints,  aside  from  continuity.  However, 
this  stimulus  invariably  gives  a  sensation  of  a  straight  line  receding  in  space  at  some  fixed 
angle.  Similarly,  a  random-dot  stereogram  is  assumed  to  be  rectified,  so  the  geometric 
constraint  is  known  and  easily  used.  There  arc  a  finite  number  of  dots  in  each  line,  so  the 
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i  is  combinatorWd,  and  there  is  no  attempt  to  match  the  area  within  individual 
hich  is  completely  degenerate.  The  degeneracy  is  ignored  through  an  assumption 
licity  in  interpolation,  i.e.  it  is  assumed  nothing  new  is  happening  within  the  dots, 
only  true  that  nothing  knowable  is  happening.  More  generally,  we  are  concerned 
eas  of  maximal  dimension  taking  values  in  a  space  of  maximal  dimension,  in  other 
in  2-dimensional  patches  that  have  a  range  of  brightnesses  or  colors.  The  results 
almost  entirely  on  these  dimensions,  so  if  we  consider  situations  involving  difTcrent 
ons,  we  must  expect  dilTcrcnt  results. 

sons  for  machine  stcrcopsis  arc,  as  we  said  above,  that  monochrome  stereo  must 
reful  attention  to  correctly  using  geometric  constraints,  while  the  topology  is 
Le  to  find  a  match  between  level  set  structures  of  some  appropriate  measurement 
tee  characteristic.  On  the  other  hand,  color  may  olTcr  a  way  to  avoid  this  problem, 
s  a  large  literature  on  machine  stereo  vision,  and  we  will  not  attempt  a  review  here, 
epresentative  works  arc  [Arnold,  It.I).  1983],  [Maker  1981),  [Maker  ct  al.  1983), 
■d  and  Fischlcr  1982],  [Connery  1977,  Connery  1980],  [Crimson  1980),  [Hannah 
Ohta  and  KamuJe  1983],  [Marr  1982],  [Marr  and  Poggio  1976,  Marr  and  Poggio 
Moravcc  1977,  Moravec  1980],  [Nevalia  1970],  [I’anton  1978],  [Quam  1971].  For 
jt  part,  the  cITccts  or  geometry  are  not  carefully  considered;  usually  it  is  assumed 
lages  arc  rectified,  and  no  account  is  taken  of  possible  distortion  in  the  support 
alors  used  in  the  matching  process.  Using  roughly  vertical  edges,  i.e.  places  of 
rizontal  gradient  but  small  vertical  gradient,  renders  some  immunity,  since  these 
limally  changed  under  the  distortions  or  typical  imaging  situations.  [Arnold,  It. I). 
.udied  how  the  distribution  of  edge  angles  is  related  to  geometry.  [Maker  1981]  did 
inline  interpolation,  but  assumed  rectified  images;  this  is  improved  in  [Maker  ct  al 
vhcrc  cpipolar  geometry  is  explicitly  considered.  The  cpipolar  geometry,  however, 
rmined  by  a  previous  process  of  camera  solving  involving  known  interest  point 
ondcnccs.  This  permits  cpipolar  line  correspondence,  but  no  correction  is  made 
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for  distortion  of  operator  supports.  [Panton  1978]  made  some  use  of  epipolar  constraints, 
and  shaped  the  window,  but  doing  this  involved  already  having  estimates  of  depth  and 
surface  shape.  [Grimson  1980],  following  in  the  footsteps  of  [Marr  and  Poggio  1977], 
assumed  rectified  images,  but  found  that  this  was  not  a  reliable  assumption  and  resorted 
to  a  vertical  search  to  compensate  for  geometric  factors.  [Gennery  1977,  Gennery  1980] 
was  able  to  deal  with  imaging  geometries,  but  did  little  about  operator  support  distortion. 

How  does  the  constraint  provided  by  epipolar  geometry  fit  into  the  theory  we  have  been 
developing?  The  epipolar  constraint  is  quite  analogous  to  the  general  matching  constraint: 
1-dimcnsional  objects  must  be  matched  to  corresponding  l-dimcnsional  objects  (wc  are 
confining  ourselves  now  to  monochrome  pictures).  For  general  matching  the  1-dimcnsional 
objects  are  level  sets.  For  epipolar  matching,  they  are  the  epipolar  lines.  Wc  do  not  study 
this  in  detail,  but  in  the  generic  situation,  one  would  expect  these  2  families  of  curves 
to  intersect  each  other  transversely,  and  therefore  give  a  discrete  set  or  solutions  for 
each  point  to  be  matched.  It  remains,  however,  to  study  what  the  degeneracies  of  this 
situation  arc.  This  is  quite  independent  of  the  basic  problem  or  determining  the  epipolar 
geometry.  Generally  this  must  be  done  by  Borne  combination  of  knowing  the  imaging 
parameters  and  solving  a  correspondence  problem.  Machine  systems  have  relied  heavily 
on  the  latter,  so  the  problem  is  more  subtle  than  may  appear  at  first. 

When  is  this  analysis  useful? 

The  Fi  and  Ft  of  Fig.  (*')  and  the  F\  and  Fa  or  Fig.  (CM),  i.e.  the  “picture”  functions 
wc  have  considered  in  the  general  matching  problem  arc  assumed  to  be  intrinsic  to  the 
object  that  is  imaged.  In  practice,  the  absolute  intensity  levels  or  colors  which  one  has 
available  in  a  set  of  images  arc  not  completely  precise,  reliable,  or  consistent.  E.g.,  they 
arc  likely  to  differ  in  bias  (reference  0  level)  and  gain  (measuring  scale),  sufTcr  the  effects 
of  change  in  viewpoint  and  lighting  on  image  irradiancc,  and  contain  digitisation  noise. 
Such  considerations  have  discouraged  people  from  using  programs  that  try  to  match  raw 
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measured  intensity  values  directly,  and  instead  have  led  to  the  use  of  derived  values  which 
are  felt  to  be  more  stable. 

Our  results  are  not  just  statements  about  so-called  intensity  matching.  The  above 
theorems  about  matching  are  statements  about  intrinsic  surface  characteristics  associated 
with  points  in  images.  They  remain  true  even  if  the  images  themselves  are  not  directly 
matchable;  i.e.  if  our  goal  in  matching  is  to  match  points  that  have  the  same  value 
of  an  intrinsic  function,  then  our  theorems  will  govern  the  uniqueness  of  the  match, 
regardless  of  how  the  actual  images  must  be  manipulated,  or  how  they  came  about.  If 
a  derived  function,  truly  intrinsic  to  the  object,  is  to  be  matched,  our  results  are  just  as 
applicable,  providing,  of  course,  that  the  new  function  is  not  computed  in  some  degenerate 
way,  destroying  gcncricity  (which  would  lead  to  even  greater  degeneracy  in  the  solution). 
E.g.,  the  digitization  process  cannot  decrease  the  ambiguity,  Bincc  it  is  a  projection  to 
a  lower  dimensional  space.  Since  this  mapping  cannot  be  1-1,  it  is  unavoidable  that  the 
ambiguity  will  be  increased,  unless  very  special  conditions  occur.  We  have  not  studied 
the  degradation  imposed  by  digitization  systematically  here. 

Extension  to  unknown  bias  and  gain  settings 

What  happens  iT  we  try  to  apply  our  analysis  to  functions  which  are  not  intrinsic  to 
the  surface?  h'or  certain  kinds  or  ambiguity  or  lack  of  calibration,  we  would  like  to 
know  that  the  data  we  get  still  allows  the  same  uniqueness  or  degeneracy  of  match  as 
with  an  intrinsic  function.  As  an  example  of  such  a  situation,  we  analyze  what  happens 
when  gain  and  bias  values  arc  unknown.  We  have  chosen  thiB  example  because  it  is 
commonly  believed  that  the  uncontrollabilily  or  these  parameters  is  a  major  impediment 
to  intensity-based  matching.  We  show  that  these  extra  degrees  of  freedom  have  no  cITcct 
on  the  degeneracy  or  uniqueness  of  the  matching  problem.  The  extra  ambiguity  docs, 
however,  pose  a  greater  challenge  for  a  matching  algorithm. 

Before,  we  were  concerned  with  the  problem  of  Kig.  (a):  finding  g*  such  that  F\  =  F%ogw. 
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If  a  measurement  of  the  variable  x  yields  the  value  ax  +  b,  then  a  is  called  the  gain  and 
b  is  called  the  bias. 

Suppose  that  we  observe  the  functions  F\  and  Ft  as  before,  but  now  the  bias  and  gain 
settings  may  be  different  between  the  observations,  bo  we  must  first  correct  for  the  different 
settings  before  matching.  This  correction  can  be  compressed  into  a  single  linear  function, 
giving  the  new  situation  shown  in  Fig.  (a-bias). 


The  matching  problem  then  becomes  to  find  g„  such  that  (a/fy  +  b)  o  gw  =  for  some 
a  €  R,b  6  Rn,  and  we  arc  concerned  with  the  question  whether  such  a  gw  is  unique;  i.e. 
whether  there  exists  some  other  gw  which  makes  the  diagram  commute  for  perhaps  some 
other  values  of  a,  6.  Following  the  same  reasoning  os  earlier,  this  is  the  same  as  asking  for 
a  dilfcomorphism  h  which  makes  Fig.  (cquiv-bias)  commute  for  some  values  of  a,  b, c,  d. 
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We  now  will  prove 

Theorem.  The  conclusions  of  the  2-color  theorem  remain  unchanged  even  for  unknown 
bias  and  gain  differences  between  pictures. 

If  we  try  to  find  h  only  for  the  situation  that  c  =  I  and  d  —  0,  we  have  exactly  the 
problem  we  considered  before,  without  gain  or  bias.  So  any  h  which  satisfies  our  old 
conditions  will  also  work  if  there  is  gain  and  bias  error,  although  of  course  there  may  be 
even  more  h' s  for  other  values  of  c,d.  Thus,  the  gain  and  bias  matching  problem  is  at 
least  as  degenerate  as  we  have  proved  earlier  for  the  “pure”  problem. 

The  situation  of  Pig.  (a-bias)  can  also  be  represented  as 
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where  we  have  now  included  the  unknown  gain  and  bias  parameters  in  a  map 


r,  :  Rn  — ►  Rn 
y  ay  4*  b 


(Incidentally,  one  can  take  a  to  be  some  n  X  n  matrix,  to  allow  for  different  gains  in 
different  Bpcctral  bands,  including  linear  crosstalk.  In  the  absence  of  crosstalk,  a  is  a 
diagonal  matrix.)  Then  the  analog  of  Fig.  (cquiv-bias)  is  Fig.  (cquiv- bias'). 
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Fig.  (cquiv-bias') 


Fig.  (cquiv'-bias') 


We  see  from  this,  as  we  earlier  saw  from  Fig.  (oquiv)  that  the  problem  of.  uniqueness  is 
equivalent  to  finding  h  such  that  the  diagram  in  Fig.  (cquiv'-bias')  commutes  for  some 
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When  T  is  the  identity,  we  have  the  problem  which  we  analyzed  earlier,  so  we  are 
interested  in  what  happens  for  nontrivial  T.  Using  the  same  reasoning  as  before,  we  see 
that  a  necessary  and  sufficient  condition  for  commutativity  is  that  Vy  €  Rn  h(F^ (y))  = 
FJl(T[ y))  which  we  can  write  h  :  F7*(y)  *-*•  F7!(!T(y)).  (Note  that  the  inverse  images 
are  sets,  not  points,  as  Fi  may  not  be  1-1.)  But  there  isn't  any  guarantee  that  T[y)  6 
Range(Fi),  even  if  y  €  Range(Fi) !  Since  h  is  a  difleomorphism,  whatever  we  say  about  h 
also  goes  for  h~l,  so  the  first  prerequisite  for  the  existence  of  h  is  that  T(Rangc(F|))  = 
Rangc(F|).  Suppose  that  Rangc(F|)  is  bounded.  This  must  be  so  if  K\  is  contained  in  a 
compact  set;  and  in  any  case  any  real  image  would  have  a  bounded  range  of  values.  Then 
it  is  easy  to  sec  that  Tor  scalar  gain  or  no  crosstalk,  this  cannot  be  the  case  Tor  a  nontrivial 
T.  For  the  more  general  case  when  T  has  cross  terms  and/or  Rangc(Fi)  is  unbounded, 
it  seems  likely  that  T  and  Fi  would  have  to  be  very  special,  and  hence  not  generic,  for 
the  range  condition  to  hold.  We  illustrate  the  ready  failure  of  the  range  condition  for 
monochrome  images  in  the  following  figure: 
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The  2  humps  represent*a  part  of  Fi,  before  and  after  the  bias/gain  change  T.  For  the 
value  represented  by  the  plane  slice,  there  may  well  be  no  corresponding  value  in  the 
bias/gain-changed  image. 

Thus  we  have  shown  that  even  if  we  allow  for  the  possibility  that  there  may  be  unknown 
bias/gain  changes  between  corresponding  images,  so  that  we  are  forced  to  do  matching 
of  values  corrected  for  arbitrary  bias/gain,  our  results  remain  unchanged.  Furthermore, 
wc  have  shown  that  for  reasonable  2”s,  T  is  unique;  i.e.,  there  is  only  1  possible  bias/gain 
transformation  which  allows  matching.  QED  | 

So  far,  though,  wc  haven’t  addressed  the  question  of  discovering  the  correct  T.  Let’s 
consider  only  the  monochrome  case,  so  that  the  bias/gain  parameters  a,  b  arc  both  scalars, 
and  assume  that  Rangc(Fi)  is  bounded.  If  the  upper  and  lower  bounds  arc  known  for 
both  Fi,  F2,  then  it’s  clear  that  there  is  a  unique  T  which  takes  corresponding  bounds  to 
each  other  (assuming  a  >  0,  i.e.  one  image  is  not  a  negative  of  the  other).  Unfortunately, 
this  would  not  work  very  well  for  real  images,  since  noise  and  inconsistencies  between 
images  might  result  in  meaningless  end  points  for  the  ranges.  Ideally,  wc  would  want  to 
match  topological  features  stably  in  the  presence  of  noise,  without  the  requirement  for 
finding  the  bias/gain  relation  independently. 

Referring  to  Fig.  (a-bias'),  wc  can  state  the  matching  problem  in  the  presence  of  noise  as 
follows.  Looking  for  the  best  match  means  trying  to  find  mappings  gw,Tt  which  optimize 
the  value  <r(fj  o  g„,Tt  o  Fi)  of  some  similarity  measure  o  :  CT(K i)  X  Cr[Ki)  -  R. 
(Candidates:  l?  distance,  cross-correlation,  etc.)  The  measure  should  be  chosen  in  such 
a  way  that  the  optimal  g„,Tt  arc  in  fact  the  most  probable,  given  information  about 
the  statistics  of  the  noise,  the  statistics  of  gw  and  Tt ,  and  the  statistics  of  images,  i.e. 
some  “probability”  distribution  on  Cr(K\).  The  quotes  arc  there  because  it  is  a  difficult 
problem  to  define  even  a  measure  on  an  infinite-dimensional  space;  c.g.  such  spaces  do 
not  admit  translation  invariant  measures.  Also  the  optimization  must  be  carried  out  over 
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the  infinitc-dimensional#space  of  difTeomorphisms  between  K\  and  K%.  This  suggests  the 
use  of  a  variational  principle.  Of  course,  simplifying  assumptions  can  be  (and  arc)  made. 

It’s  important  to  notice  that  the  addition  of  noise  does  not  change  the  applicability  of  our 
results.  For  if  we  are  seeking  the  “true”  gnt  i.e.  the  one  which  specifics  the  correspondence 
between  the  unadulterated  images,  then  any  equivalent  g *  that  may  exist  as  a  consequence 
of  our  results  will  match  the  identical  unadulterated  images,  hence  will  be  just  as  good  as 
g„  under  any  measure  of  the  form  of  a.  Of  course,  the  corruption  of  the  noise  may  lead  to 
further  degeneracy.  One  expects  that  the  similarity  measure  can  be  designed  so  that  the 
matching  process  is  stable.  I.e.,  small  amounts  of  noise  should  lead  to  small  uncertainties 
in  the  match  (modulo  the  topological  ambiguities  we  have  shown),  and  sufficiently  small 
amounts  of  noise  should  not  disturb  topological  properties  of  the  solution. 

Rather  than  tackle  this  difficult  problem  now,  we  sketch  a  possible  (though  simple- 
minded)  way  of  finding  Tv  independently.  In  the  presence  of  noise,  some  averaging  method 
is  called  for.  Instead  of  trying  to  match  only  extreme  values  [note  that  finding  T  is  a 
matching  problem  in  the  range  space],  we  might  try  to  somehow  match  some  kind  of 
average  ranging  over  all  values.  Fig.  (Range  condition)  suggests  one  approach,  which 
is  reasonable  if  gw  is  approximately  an  isometry  (which  is  often  the  case  in  practice). 
The  idea  is  that  for  each  y  6  Range(/''i),  the  measure  p  oi  h  :  FJ  (y)  should  be  the 
same  as  that  of  /?Yl(7*(lf))-  If  wc  picture  the  slicing  plane  in  Fig.  (Range  condition)  as 
moving  up  and  down,  then  what  we  arc  saying  is  that  corresponding  slices  in  the  2  images 
should  have  equal  total  arc  length.  Wc  already  know  that  gencrically,  these  slices  will  be 
1-manifolds,  so  we  are  justified  in  using  arc-length  as  our  measure.  Let  &\(y)  be  the  total 
arc  length  of  F~l(y),  for  i  =  1,2.  Then  wc  can  plot  S i  and  Sz  as  functions  of  the  real 
variable  y.  Note  that  these  arc  continuous,  but  will  have  discontinuities  in  derivative, 
corresponding  to  critical  values  of  F+.  The  graphs  will  look  something  like  those  in  Fig. 
(Range). 
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Fig.  (Range) 


If  T  is  monotonic  (which  is  of  course  the  case  Tor  the  2-scalar  bias/gain),  then  our  problem 
is  to  match  S i  with  by  a  linear  map.  In  practice  this  would  mean  matching  histograms 
of  gray  values.  Since  the  search  space  is  only  2-dimensional,  this  could  be  done  by  a  brute 
force  method.  Alternatively,  techniques  exist  Tor  maximizing  such  matches,  e.g.  time- 
warping.  This  technique,  like  hislogramming  methods  in  general,  ignores  topological  and 
geometrical  rclationshi ps. 
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Topological  Invariants  of  the  Picture  Function 

Introduction 

It  is  well-known  that  computer  scientists  are  fond  of  graph- theoretic  approaches,  so  it  is 
pleasing  that  the  didcomorphic  invariants  of  (monochrome)  pictures  arc  well-represented 
as  graph  and  tree  structures.  In  this  section,  our  goal  is  a  representation  of  the  picture 
topology  to  be  used  in  the  service  of  the  matching  problem.  This  requires  us  to  present 
the  applicable  theory  from  differential  topology,  adapt  it  to  our  purposes,  and  allows  us 
to  make  some  observations  along  the  way. 

We  review  the  definition  of  Smalc  diagrams,  and  establish  the  topological  properties  of 
level  set  trees.  [Koendcrink  and  van  Doom  1979]  independently  proposed  level  sets  as 
topological  invariants,  but  put  them  to  different  use,  mainly  as  means  to  compute  features 
of  individual  images,  such  as  metrons  and  aperture  spectrum.  (Krakaucr  1971]  used  a  re¬ 
lated  structure  For  experiments  in  image  analysis.  He  was  interested  in  characterizing  the 
shapes  of  the  level  sets  at  all  image  intensity  values  as  a  method  of  object  classification,  lie 
used  measures  like  eccentricity,  region  area,  and  scatter  diagrams  in  an  effort  to  identify 
various  fruit,  a  lil  ting  goal  for  a  tree  approach.  He  did  not  consider  topological  questions; 
trie  work  was  an  attempt  at  direct  interpretation  from  region-based  descriptions,  with 
little  analysis  of  the  nature  of  the  image  intensity  function. 

We,  on  the  other  hand,  are  concerned  with  the  topology  or  the  tree,  its  deformations  and 
bifurcations  ns  we  move  through  the  space  of  pictures,  and  the  use  or  the  bifurcations 
in  handling  noise.  In  general,  we  arc  concerned  with  using  the  level  set  tree  as  a 
representation,  considering  its  generic  behavior  over  the  entire  class  of  pictures,  and  using 
it  in  the  matching  problem.  In  a  later  section,  we  apply  this  structure  to  understanding 
scale  space  [Witkin  1983].  We  also  make  some  observations  about  the  stability  of  zero- 
crossings.  These  applications  arc  all  new. 
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Smale  diagrams  and*level  set  trees 

We  can  formulate  the  general  matching  problem  as  a  finite  graph-matching  problem.  The 
nodes  of  the  graph  are  to  be  the  critical  points  of  the  image  function  Fi,  and  the  (graph) 
edges  are  the  gradient  paths  between  the  critical  points.  As  we  saw  earlier,  Morse  theory 
tells  us  that  gcnerically  the  only  critical  points  arc  maxima,  minima,  and  saddles.  More 
formally, 

Definition.  Let  /  £  C(A/m,R)  be  a  Morse  function.  For  each  critical  point  p  £  Mm, 
define  the  stable  and  unstable  manifolds,  resp.  of  p  as  follows. 


W+(p)  =  {z£  Mm  |  the  uphill  gradient  1  inc  of  /  leaving  x  converges  to  p} 
W~(p)  =  {x  6  Mm  j  the  downhill  gradient  line  of  /  leaving  z  converges  to  p} 


Define  a  partial  order  <  (and  hence  a  directed  graph)  among  the  critical  points  by  q  <  p 

if  W+{p)  n  W-(</)  ^ 


Colloquially,  q  <  p  means  you  can  get  from  p  to  q  by  going  downhill  along  gradients  (the 
sense  or  the  partial  order  is  chosen  to  relied  f(q)  <  /(p)). 

The  Smale  diagram  [Smale  1907]  of  /  is  the  ordered  graph  obtained  by  refining  the 
preceding  partial  order  so  that  p  — i ►  q  if  q  <  p  and  there  is  no  r  between  p  and  q,  i.c.  such 
that  q  <  r  <  p.  A  generalization  of  the  same  idea  is  known  as  a  Smale  quiver  [Abraham 
and  Marsdcn  1978]. 
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spatial  extent  of  the  ignored  features.  These  are  the  2  parameters  which  are  linearly 
combined  in  a  linear  smoothing  operation,  but  here  they  are  completely  separable,  and 
therefore  accessible  to  reasoning  machinery.  Each  subtree  has  its  own  characteristics, 
so  a  structure  like  this  can  be  made  as  “adaptive”  as  you  want;  e.g.  twiddles  on  very 
big  humps  might  not  mean  too  much,  while  the  same  twiddles  on  little  humps  might  be 
quite  important,  though  linear  measures  of  local  variation  could  be  identical.  This  really 
has  a  philosophical  basis  in  the  principle  of  least  commitment  and  in  the  A1  paradigm  of 
symbolic  (thus  nonlinear!)  reasoning. 

Not  every  bifurcation,  and  therefore  not  every  smoothing,  amounts  to  lopping  off  a 
subtree.  As  we  saw  before,  we  can  also  pass  through  a  saddle  connection.  This  implies 
that  using  the  tree  data  structure  requires  some  added  sophistication,  viz.,  keeping  track 
of  where  saddle  nodes  arc  relative  to  each  other.  More  generally,  there  must  be  a  notion 
of  measure  of  stability — how  far  it  is  (in  the  function  space)  to  a  bifurcation. 

We  have  described  all  the  generic  bifurcations  of  the  level  set  structure.  A  generic  scale 
space  operator  (i.c.  a  1-paramctcr  family  of  smoothers,  whose  t  =  0  member  is  the 
identity)  can  therefore  have  only  these  bifurcations.  A  particular  operator,  however, 
might  not  have  generic  bifurcations;  e.g.  it  might  impose  some  special  constraint  that 
only  allows  special  behavior.  E.g.,  never  creating  zero-crossings  is  not  a  generic  property 
for  scale  space  smoothers  (though  it  says  nothing  about  the  bifurcations  or  critical  points). 
We  have  been  able  to  show,  though,  that  the  generic  critical  point  bifurcations  of  Gaussian 
scale  space  arc  not  special,  i.c.  arc  the  same  as  for  generic  perturbations,  and  arc  therefore 
among  those  we  have  described  [lllichcr  and  Omohundro  1984].  It  remains  to  study  which 
of  these  actually  occur,  and  what  arc  the  unfoldings  of  the  zero-crossing  level  set. 

Now  we  can  make  some  comparisons  between  Gaussian  scale  space  and  the  level  set 
topology  tree.  We  have  seen  that  zero-crossings  arc  not  stable,  e.g.  near  a  saddle  in 
the  picture.  It  is  not  clear  whether  the  range  of  scales  can  fix  this,  for  this  depends  on 
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the  entire  scale  space  ft  R3,  and  codimension  1  manifolds  are  surfaces.  The  locus  of 
zero- crossings  in  scale  space  is  one  of  the  level  sets  of  h. 

To  understand  this  level  set  of  h  is  to  understand  zero-crossings  in  scale  space.  [Yuille 
and  Poggio  1983]  state  that  the  Gaussian  is  the  unique  convolution  kernel  which  does 
not  create  new  zero-crossings  with  increasing  t,  under  a  number  of  re^  ilarity  conditions. 
[Babaud,  Witkin,  Duda  1983]  and  [Hummel  and  Gidas  1084]  also  study  this  question. 

We  saw  that  the  level  set  topology  tree  is  a  stable  description  of  the  level  sets  of  /,  and  has 
simple,  well- understood  bifurcations.  The  nesting  structure  gives  us  an  intrinsic,  global 
criterion  of  relative  scale,  for  if  node  x  nests  in  node  y,  we  know  that  z  is  a  “twiddle” 
of  y,  and  likewise  for  any  sub-sub-nodcs.  Mere’s  a  way  to  think  of  this.  Consider  Fig. 
(topo)  again.  The  saddle  /  is  the  2nd  type  from  the  +  column  of  Fig.  (level).  That  means 
that  we  can  take  the  stult  that  sits  above  it  on  say  the  left  side,  cut  it  olT  at  the  /  saddle 
level,  and  replace  it  by  a  simple  cap.  This  could  he  done  smoothly  by  some  bifurcations 
(critical  point  annihilations)  inside  that  side  of  the  figurc-8  of  /.  This  is  smoothing;  highly 
nonlinear  smoothing,  however.  Of  course  the  exact  single  maximum  cap  that  we  get  isn’t 
uniquely  defined,  but  why  should  it  be?  The  picture  doesn’t  have  one  cap  or  another; 
it  has  some  complicated  structure.  The  only  justification  for  choosing  a  particular  cap 
would  be  that  it  was  somehow  special.  In  the  tree  structure,  what  this  amounts  to  is 
simply  contracting  a  subtree  to  a  single  node.  In  a  real  picture,  the  tree  structure  is  apt  to 
be  quite  complicated,  with  great  numbers  of  nodes.  There  will  be  an  enormous  number  of 
ways  to  contract  nodes  to  achieve  grosser  (smoother)  representations.  Actually,  it  may  be 
belter  to  consider  the  problem  not  one  of  multiple  representations,  but  one  of  intelligent 
use  of  the  single  tree  representation  as  a  data  structure.  From  that  viewpoint,  depth  in 
the  tree  corresponds  to  degree  of  detail.  However,  that  information  must  include  more 
than  nesting  depth:  there  must  be  a  measure  of  the  significance  «r  the  ignored  subtree. 
This  can  come  from  2  things  -  the  size  of  the  up  and  down  excursions  in  the  subtree 
(i.c.  the  range  of  leaf  heights),  and  the  size  of  the  support  of  the  subtree  nodes,  i.e.  the 
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Scale  Space 

In  our  discussion  of  the  works  of  Marr,  Hildreth,  Canny,  and  others,  we  saw  that  an 
important  problem  is  the  description  of  the  picture  at  various  scales.  Small  scales  have 
the  advantage  of  precision,  and  can  pick  out  small  features,  but  are  susceptible  to  noise. 
Large  scales  can  sec  large  features  that  aren’t  visible  in  a  small  peephole,  and  can  have 
good  noise  immunity  thanks  to  averaging,  but  large  linear  operators  confound  space  and 
intensity — they  blur  things. 

[Kocndcrink  and  van  Doom  1079]  proposed  an  aperture  spectrum  of  an  image  be  computed 
by  convolving  with  a  1-paramclcr  family  of  window  functions.  The  aperture  spectrum  is 
the  set  of  bifurcation  values  of  the  control  parameter,  in  the  usual  parlance  of  bifurcation 
theory.  (Crowley  1982,  Crowley  and  Parker  1984,  Crowley  and  Stern  1984]  searched  for 
some  geometric  features  in  data  resulting  from  a  sequence  of  convolutions  with  Gaussians, 
but  did  not  consider  geometric  or  topological  theory.  (Witkin  1983]  also  convolved 
with  a  I -parameter  family  of  Gaussians,  and  considered  the  bifurcations  of  zero-crossing 
topology  in  the  combined  control-behavior  space  (sec  (Poston  and  Stewart  1978]),  i.c.  the 
product  space  of  the  parameter  and  image,  which  he  calls  scale  space.  This  is  the  usual 
approach  of  bifurcation  theory,  but  (Witkin  1983]  did  not  consider  topological  theory. 
The  scale  space  operation  is 

h(t,  x)  =  Gt(x)  *  /(*) 

where  /  is  the  image,  and  G (  is  a  parametrized  kernel.  For  scale  space,  a  second  derivative 
operation  is  required,  so  cither  /  is  the  Laplacian  of  the  image  and  Gt  is  a  family  of 
Gaussians,  or  /  is  the  image  and  Ct  is  a  family  of  Laplacians  or  Gaussians  (z  is  a  point 
in  the  picture  space  Rn).  Under  these  conditions,  the  object  of  interest  is  the  locus  of 
zeroes  of  h.  When  (1,  z)  is  a  regular  point  of  h,  the  inverse  function  theorem  tells  us 
that  h~'(Q)  is  of  codiincnsion  1  near  (t,  z).  This  allows  tying  together  zero-crossings  at 
different  scales,  which  was  a  major  obstacle  for  many  edge  finders.  For  a  picture  on  R2, 
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critical  point)  along  the  section  of  the  tree  on  which  it  lies.  When  2  nodes  in  the  tree  come 
together  this  way,  giving  3  offspring  of  the  combined  node,  we  have  a  saddle  connection. 
Also,  buds  can  form  anywhere  (but  gcncrically  away  from  other  nodes),  creating  critical, 
points,  and  leaf  nodes  can  atrophy  to  nothing,  annihilating  critical  points. 

For  a  generic  path  like  this,  critical  points  can  only  be  created  or  annihilated  in  pairs,  for 
it  is  easy  to  construct  arbitrarily  small ‘perturbations  which  separate  the  critical  points 
into  pairwise  events.  Recall  that  the  Morse  inequalities  tell  us  that  (with  good  behavior 
at  the  boundary),  the  sum 

X>i)V. 

*=o 

must  remain  unchanged,  so  new  critical  points  can  only  be  created  or  annihilated  in 
extremum-saddle  pairs  (the  saddle-node  bifurcation). 

In  the  presence  of  noise,  then,  the  level  set  structures  of  the  2  images  may  not  be  the  same. 
They  will  difTcr  by  some  sequence  of  the  above  bifurcations,  which  have  simple  cfTccts  on 
the  level  set  tree.  fCquivalcntly,  the  2  images  will  be  connected  by  a  path  in  function  space 
which  crosses  some  number  of  bifurcation  frontiers.  The  matching  problem  can  then  be 
reduced  to  a  minimal  path  or  optimal  tree  matching  problem  (with  labelled  trees).  The 
path  to  be  minimised  can  be  viewed  as  a  sequence  of  level  mil  topologies,  equivalent  to  a 
sequence  of  bifurcations,  or  as  the  path  in  the  function  space  itscir.  We  have  not  studied 
the  optimization  criterion,  but  measures  which  could  be  taken  into  consideration  arc  the 
number  of  bifurcations  and  the  size  or  perturbations  (in  view  of  knowledge  about  the 
noise). 

Occlusions  result  in  localized  but  large  differences  between  images.  Globally,  they  could 
be  handled  by  excision  and  pasting  of  tree  parts  (grafting?).  This  is  delicate,  however, 
since  one  must  first  study  the  global  clTccls  of  excision  and  pasting.  Another  approach  is, 
to  use  a  number  of  local  analyses,  for  example  for  a  number  of  regions  selected  by  bump 
functions  (which  go  from  I  to  0  smoothly  to  all  orders  in  a  finite  space). 
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Fig.  (saddlR-conncction) 


The  top  part  of  Fig.  (Raddle-connection)  shows  a  generic  level  set  structure  with  3  maxima. 
We  can  flmoolhly  increase  the  height  of  the  lower  saddle  until  it  is  precisely  at  the  same 
level  as  the  other  saddle,  at  which  point  the  topology  abruptly  changes  to  the  middle 
picture,  called  a  saddle  connection.  As  we  continue  raining  the  level  of  the  saddle,  we 
immediately  gel  the  bottom  picture,  which  is  again  stable.  A  similar  situation  occurs 
when  the  other  kind  of  saddle  is  involved. 

If  we  think  of  the  path  through  function  space  as  a  homotopy  of  maps  to  level  set  tree 
spaces,  it  is  easy  to  visualise  how  the  tree  can  change.  Chnnging  the  height  of  a  saddle 
corresponds  to  sliding  a  saddle  node  (a  node  in  the  tree,  not  the  same  as  a  saddle-node 
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talk  about  what  a  single  noise  signal  can  do  to  the  topology;  we  are  not  going  to  attempt 
here  a  statistical  study  of  these  effects,  so  we  will  not  consider,  c.g.,  what  the  average 
effect  on  topology  will  be.  It  is,  however,  possible  to  do  such  a  study;  the  ensemble 
properties  must  be  considered  over  an  appropriate  space  of  control  parameters,  of  course. 
E.g.,  these  could  be  taken  as  parameters  in  the  equation  of  a  smooth  surface.  Examples  of 
statistical  yet  topological  studies  of  smooth  functions  are  [Longuct-Higgins  1960],  [Derry 
1977],  [Berry  and  Hannay  1977]. 

Think  of  a  knob  we  can  turn  that  gradually  adds  the  smooth  noise  signal  that  has 
contributed  to  the  function  we  arc  observing;  this  corresponds  to  a  path  in  function  space 
whose  parameter  is  the  amount  the  knob  has  been  turned.  Let  t  =  0  correspond  to  the 
unadulterated  picture,  and  t  =  1  to  the  picture  with  the  noise  we  are  actually  observing. 
E.g.,  we  could  let  //v(t)  =  fo  +  tN,  where  /o  is  the  unadulterated  picture,  N  is  the  noise, 
and  /jv(<)  is  the  adulterated  picture  at  knob  setting  t.  What  happens  to  the  level  set 
topology  as  we  turn  our  knob?  Since  it  is  stable,  the  topology  changes  at  places  where 
the  function  is  not  generic.  These  changes  arc  called  bifurcations  or  catastrophes  and  are 
completely  classified  by  singularity,  or  so-called  catastrophe,  theory  [e.g.  Arnold  1984, 
Boston  and  Stewart  1978,  Chow  and  Hale  1982,  looss  and  Joseph  1980],  providing  simple 
rules  which  specify  how  the  topology  of  /  can  (locally)  change  under  such  perturbations. 
Kor  our  case,  l-paramctcr  families  on  R2,  the  situation  is  especially  simple.  Gcncrically, 
there  arc  only  2  ways  this  process  can  change  the  level  set  topology: 

•  Passing  through  a  saddle- connection, 

•  Creating  or  annihilating  critical  points. 
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zero-crossing  contours.  *Of  course,  one  wants  ultimately  to  segment  the  image,  and  the 
level  set  topology  does  not  do  that.  We  have  argued  that  reliable  segmentation  can  only 
be  done  after  first  getting  a  qualitative  global  understanding  of  the  picture  function,  the 
type  of  understanding  in  which  the  level  set  topology  is  one  clement,  and  a  beginning 
one.  We  wouldn’t  advise,  therefore,  to  attempt  segmentation  at  this  stage.  Nevertheless, 
if  one  insists  that  it  is  approximately  the  zero  values  one  is  interested  in  (we  do  not 
necessarily  contend  they  are  the  correct  thing  to  be  interested  in),  then  if  one  knows 
which  critical  points  have  values  near  0,  there  arc  then  a  finite  number  of  zero-crossing 
topologies  depending  on  whether  any  given  critical  point  has  a  positive  or  negative  value. 
One  could  then  use  a  constraint  propagation  procedure  based  on  other  information  to 
select  a  particularly  interesting  subset  of  topologies. 

Noise  and  bifurcation! 

Let's  come  back  to  the  problem  of  noise  changing  the  level  set  topology  or  the  Smaie 
diagram. 

What  kind  of  a  model  arc  we  going  to  use  for  noise?  We  arc  working  in  the  domain  of 
smooth  functions,  so  we  are  going  to  take  any  noise  signal  to  be  a  smooth  function,  in 
keeping  with  the  premise  that  the  image  irradiancc  is  a  smooth  function.  The  statistical 
analysis  of  noise  involves  computing  integrals,  so  a  natural  setting  for  statistics  is  in  L2, 
a  space  which  contains  mainly  non-smooth  functions.  There  are  several  reasons  why  we 
arc  justified  in  nonetheless  taking  our  noise  signal  to  be  smooth.  While  it  is  convenient 
to  do  integration  in  f2,  physical  signals  arc  in  reality  bandiimiled.  Any  imaging  situation 
is  well-modelled  by  a  process  that  includes  convolution  with  a  smooth  kernel,  e.g.  a 
Gaussian,  i.c.  it  is  impossible  to  avoid  some  amount  of  blurring.  As  wc  stated  in  detail 
earlier  in  the  section  Edge  Localization  in  Doth  8  and  x  of  the  chapter  Contributions 
to  Edge  Detection,  a  standard  theorem  [Lang  1969]  tells  us  that  the  result  of  such  a 
convolution  is  as  smooth  as  the  kernel,  even  if  the  signal  is  only  in  Ll.  We  arc  going  to 
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is  evident  in  zero-crossing  edge  finders  when  the  connectivity  shown  in  Pig.  (Conn-a)  is 
found  in  one  image,  while  the  connectivity  of  the  corresponding  region  in  the  other  image 
is  found  to  be  as  in  Fig.  (Conn-b). 


Fig.  (Conn) 


Referring  back  to  Fig.  (saddle),  it  is  cosy  to  sec  how  this  can  happen  with  only  a 
small  amount  of  noise  (or  inexact  mask-region  correspondence  between  the  images  in 
a  convolution).  Suppose  that  the  function  whose  zero  crossings  one  is  seeking  looks  like 
a  saddle,  with  the  critical  value  near  0.  Then  it  is  easy  to  imagine  that  in  one  image  the 
zero  plane  would  slice  the  saddle  a  little  below  the  critical  value,  while  if  the  other  image 
has  slightly  smaller  values,  the  slice  would  be  above,  yielding  the  grossly  dilTcrcnl  (i.e. 
topologically  dilTcrcnl)  connectivity  patterns. 

We  stress  that  in  this  case,  even  though  the  zero-crossing  connectivity  is  unstable,  the 
level  set  topology  and  the  Smale  diagram  are  unchanged.  Topologically,  at  least,  the  level 
set  tree  and  the  Smale  diagram  are  more  robust  representations  of  the  function  than  the 
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nesting  diagram  only  classifies  the  coarser  space  of  level  sets.  For  example,  in  Fig.  (topo), 
the  2  parts  of  the  figure-8  of  saddle  d  can  be  essentially  interchanged  by  a  1-parameter 
family  of  difieomorphisms,  viz.,  by  shearing  in  a  neighborhood  of  some  level  set  just  below 
d,  just  as  in  the  proof  of  the  2-color  theorem.  In  other  words,  choose  a  regular  (circular) 
level  set  between  d  and  /,  grab  the  stuff  above  it,  and  rotate  that  stuff  180°,  sliding  along 
the  level  set.  That  can  be  made  smooth  by  shearing  in  a  neighborhood  and  splicing  with 
a  bump  function.  This  doesn’t  change  the  level  set  topology,  but  it  does  interchange  the 
roles  of  the  components  of  the  figure-8  in  the  Smale  diagram. 

Stability 

The  Smale  diagram  and  the  level  set  topology  are  stable  for  generic  functions  (i.e.  Morse 
functions),  i.e.  they  do  not  change  under  small  perturbations.  That  means  they  are 
good  ways  to  characterize  the  Morse  functions,  since  the  space  of  such  functions  is  then 
partitioned  into  open  regions  (the  boundaries  arc  non-generic).  Notice  that  stability  is 
a  criterion  for  robustness,  in  that  it  means  that  there  is  some  latitude  for  error  which 
leaves  the  description  unchanged.  Right  now  we  are  interested  in  the  level  set  structure 
of  /  rather  than  the  .topology  of  V/,  so  we  will  only  discuss  the  level  set  topology. 
Unfortunately,  the  stability  above,  by  itself,  is  not  quite  good  enough  for  a  practical 
system.  What  “small  perturbations”  really  means  is  that  the  diagram  will  not  change 
if  the  perturbation  is  small  enough  (values  of  derivatives  arc  included  in  the  measure). 
The  problem  is  that  there  is  no  guarantee  that  noise  corrupting  the  images  will  be  small 
enough.  Furthermore,  Tor  any  given  size  of  nontrivial  noise,  one  can  find  a  Morse  function 
with  a  critical  point  so  delicate  that  the  level  set  topology  will  be  changes!  by  the  noise 
(though  not  by  «  times  the  noise  for  some  t,  thus  adhering  to  the  stability  theorem). 
Thus,  a  node  (extremum)  and  a  saddle  might  be  introduced. 

£• 

.  not  even  chattel  ns  the  level  set  tonolotrv  or  Smale  diaeram. 


Instability  of  sero-crossini 
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Fig.  (tree) 


Fig.  (tree)  shows  2  representations  of  the  level  set  space  for  the  image  fragment  in  Fig. 
(lopo).  The  tree  on  the  left  is  drawn  to  show  the  nesting  structure,  relative  heights,  and 
extremum  type  of  the  critical  points:  the  absolute  height  or  each  node  in  the  tree  is  meant 
to  correspond  to  the  value  or  its  corresponding  critical  point  in  the  image,  and  the  arrow 
is  to  be  read  “nests  in."  The  right  hair  or  the  figure  shows  just  the  bare  tree,  where  a 
subnodc  nests  in  its  parent  node.  Node  a  is  the  global  maximum,  and  u>  is  the  global 
minimum,  in  the  following  sense.  For  a  compact  manifold  without  boundary,  a  and  u 
arc  always  critical  points,  but  when  there  is  a  boundary,  they  correspond  to  the  maximal 
and  minimal  closed  level  sets.  This  can  be  improved  if  the  gradient  is  always  transverse 
to  the  boundary,  or  smoothing  with  bumps  can  allow  extension  to  the  sphere. 

The  important  difference  between  the  level  set  tree  and  the  Smalc  diagram  for  functions 
on  2-manifolds  is  this.  Functions  with  the  same  level  set  topology  can  have  different  Smale 
diagrams,  for  the  Smalc  diagram  classifies  the  gradient  flow  of  the  function,  while  the 
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Fig.  (level) 


The  generative  rule  is  that  any  *+’  can  be  replaced  by  anything  from  the  *+’  column,  and 
similarly  for  The  symbols  represent  a  maximum  and  minimum,  respectively.  The 
2  types  or  figurc-8’s  are  saddles  (or  more  precisely,  they  arc  the  separalrices  associated 
with  the  saddle  at  the  crossing).  This  leads  to  a  representation  of  the  topology  of  /  as  a 
tree,  where  the  branch  nodes  arc  saddles  and  the  leaves  are  extrema.  In  fact, 

Lemma.  As  a  topological  space,  the  level  set  tree  is  homcomorphic  to  the  space  of  level 
sets. 

Proof.  The  topology  is  given  by  the  local  metric  induced  by  the  level  values.  It  is  easy 
to  check  that  this  is  well-defined. 

This  structure  is  very  similar  to  the  Smalc  diagram,  and  the  problem  or  matching  2 
(monochrome)  images  is  now  equivalent  to  finding  tree  isomorphisms  between  the  level 
act  topologies  (which  preserve  the  image  values  at  the  nodes). 
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like  drainage  basins  forsakes).  The  basins  of  attraction  are  separated  by  1-dimensional, 
boundaries  (like  the  continental  divide).  Clearly  every  non-critical  point  must  be  in  ar 
basin  or  on  such  a  boundary.  If  we  turn  things  upside-down,  the  new  basins  of  attraction 
are  now  basins  of  repulsion  for  the  right-side  up  picture.  [Nackman  1982,  Nackman  1984] 
catalogs  some  of  the  behavior  of  functions  on  R2  which  can  be  deduced  from  the  partial 
order  used  to  define  the  Smale  diagram.  (lie  calls  this  partial  order  the  critical  point 
configuration  graph).  Some  examples  of  the  modern  mathematical  approach  to  these 
features  can  be  found  in  [Abraham  and  Shaw  1981,  Gilmore  1981,  Thom  1972,  Ilirsch 
and  Smale  1 974,  Abraham  and  Marsdcn  1978,  Smale  1967]. 

The  Smale  diagram  is  a  difTcomorphic  invariant  of  the  vector  field  V/.  The  matching 
dilTeomorphisin  gr,  however,  carries  with  it  the  values  of  /,  not  of  V/.  And  difTcomorphic 
equivalence  of  /  and  h  is  not  enough  to  guarantee  dilTcoinorphic  equivalence  of  V/  and 
Vh,  except  for  sullicicntly  small  perturbations.  If  gw  is  not  loo  extreme,  though,  the 
problem  of  matching  2  (monochrome)  images  is  equivalent  to  finding  an  isomorphism 
between  the  Smale  diagrams  of  the  images. 

Instead  of  considering  the  topology  oT  the  Stnalc  diagram,  which  classifies  the  gradient 
vector  fields,  we  can  consider  the  topology  of  the  level  sets  of  /.  As  [Kocndcrink  and 
van  Doom  1979]  have  observed,  these  level  sets  observe  simple  rules  in  their  nesting, 
which  define  a  generative  grammar.  In  fact,  the  only  possible  structures  arc  shown  in 
Kig.  (level). 
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Fig.  (Smalc  diag) 

Fig.  (lopo)  is  an  example  of  a  level  Mil  structure  one  might  find  in  an  image.  The  partial 
order  we  have  defined  among  its  critical  points  is  shown  in  Fig.  (partial  order),  and 
the  refinement  to  a  Smalc  diagram  is  shown  in  Fig.  (Smalc  diag).  The  dashed  arrows 
represent  partial  order  relationships  which  might  exist  with  other  critical  points  if  we  had 
extended  the  picture  farther. 

The  entire  topological  structure  of  V/  is  given  by  its  Smalc  diagram  (possibly  along 
with  some  orientation  information)  [Peixoto  1073].  If  we  know  the  Smalc  diagram,  then 
we  know  how  the  critical  points  arc  connected,  which  lets  us  make  deductions  about 
the  topology  of  the  level  sets  between  them.  I?.g.,  each  critical  point  has  a  basin  of 
attraction,  the  set  of  all  points  whose  gradients  eventually  lend  to  that  critical  point  (just 
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the  behavior  of  the  smoothing  near  saddles.  The  level  set  topology,  however,  it  stable. 
A  case  has  been  made  that  scale  space  allows  tracking  sero-crossings  from  coarser  to 
finer  resolution.  The  level  set  tree  requires  no  tracking;  the  coarseness  is  established 
by  depth  in  the  tree  (in  the  above  sense)  and  the  level  sets  at  that  depth  are  already 
precisely  located.  Gaussian  scale  space  contains  metrical  information,  for  the  result  at 
a  particular  scale  says  something  about  extent.  However,  this  metrical  information  is 
confounded  with  intensity  information,  as  we  have  seen,  so  it  is  of  limited  value.  The 
level  set  tree  allows  separating  space  and  intensity.  There  is  a  double  confounding  of  the 
metrical  information,  actually,  because  if  we  lift  the  Gaussian  kernel  to  the  surface  that 
is  projected  to  the  picture,  the  nature  of  the  kernel  depends  on  the  Bhape  and  orientation 
of  the  surface.  This  means  that  a  change  in  viewer  position,  e.g.,  will  give  different 
results  for  the  convolution,  and  while  the  zeroes  of  the  Laplacian  of  the  raw  image  are 
invariant,  the  zeroes  of  the  smoothed  version  arc  not.  The  level  set  topology,  of  course, 
is  invariant.  Gaussian  scale  space  is  a  particular  class  of  bifurcations  of  the  level  set 
topology,  a  particular  set  of  paths  through  function  space,  and  so  a  specialization  of  the 
structure  wc  arc  proposing.  Rut  the  question  is,  why  should  the  image  values  resulting 
from  this  particular  smoothing  be  speciaif 
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Motion,  Optic  Flew,  and  Lie  Algebras* 

Introduction 

For  the  past  several  years,  many  researchers  have  been  investigating  problems  of  moving 
objects  and  observers  (see  e.g.,  (Tsai  and  Huang  1984],  [Prazdny  1981],  [Prazdny  1983], 
[Buxton  and  Buxton  1983],  [Nagel  1983],  [Horn  and  Schunck  1980],  [Tsai  1983a],  [Tsai 
1983b],  [Prazdny  1980],  [Bruss  and  Horn  1983],  [Ullman  1979]).  The  paradigm  of  this 
research  is  based  on  the  fact  that  a  point  moving  in  space  projects  to  a  point  moving 
in  the  picture.  The  problem  is  then  usually  approached  in  2  steps.  First,  to  (ind  the 
motion  in  the  picture,  the  optical  flow,  you  And  corresponding  points  in  2  or  more  frames. 
Then,  given  this  set  of  correspondences,  cither  for  a  Tew  or  Tor  many  points,  you  solve 
some  set  of  equations  which  yields  the  motion  in  space.  These  2  subproblems  have 
generally  been  approached  separately;  thus  there  are  2  classes  of  results:  how  to  match 
points  (correspondence),  and  how  to  compute  motion  from  matches  (e.g.  how  many 
corresponding  points  it  lakes).  The  correspondence  problem,  unfortunately,  is  subject 
to  degeneracies,  as  we  have  shown  above.  E.g.  at  a  single  point,  the  image  function 
and  its  lime  derivative  tell  us  nothing  about  motion  perpendicular. to  the  gradient  of 
the  image  function.  If  possible,  then,  it  would  be  belter  to  consider  the  problem  as  a 
whole,  and  avoid  new  difficulties  created  by  a  particular  choice  of  subproblems,  for,  as 
we  showed  above,  the  correspondence  problem  is  much  harder  without  knowledge  about 
the  3-dimcnsional  changes  that  underlie  the  differences  between  pictures. 

All  the  information  which  we  have  about  the  scene  is  contained  in  the  time-varying 
picture,  which  is  a  function  on  some  2-dimensional  space,  as  we  said  in  more  detail  earlier. 
Our  Anal  goal  is  to  deduce  the  shape,  position,  and  motion  of  the  3-dimcnsional  objects 
that  give  rise  to  this  function.  We  want  to  approach  this  by  looking  only  at  the  function 
itself,  i.c.  the  time-varying  image,  and  without  the  constraint  that  our  intellectual  path 


•This  work  was  done  with  the  collaboration  of  Stephen  M.  Omohundro. 


Geometric  Methods  in  Vision  Motion,  Optic  Flow,  and  Lie  Algebras  207 

go  by  way  of  Arst  finding  some  point  motions  in  the  plane. 

The  situation  is  this.  Some  rigid  object  is  moving  in  space.  Our  imaging  of  it  gives  us 
a  function,  the  image  intensity  function,  which  undergoes  continuous  distortions  most 
everywhere.  These  distortions  are  a  result  of  the  motion  and  of  the  shape  and  position 
or  the  surface.  The  problem  is  to  separate  and  quantify  the  sources  of  what  we  see. 

The  whole  time  course  of  the  image  has  a  vast  amount  of  information  in  it,  so  it  is  easier 
to  consider  only  parts  of  the  information  at  once.  One  can  look  at  what  is  happening 
over  whole  chunks  of  time,  or  only  at  a  single  instant.  For  differentiable  situations,  the 
differential  theory  is  usually  the  easier,  transforming  nonlinear  problems  into  linear  ones, 
so  that  is  where  we  start. 

We  prove  some  new  theorems  establishing  how  much  picture  information  is  necessary 
and  suflicient  to  specify  object  motion.  An  important  feature  is  that  we  do  not  assume 
that  we  can  track  individual  points  in  the  image,  nor  that  we  arc  given  any  of  their 
velocities  (i.c.,  the  optic  (low).  The  major  result  is  the  6  point  df/dt  theorem,  showing 
that  gcncrically*lhc  values  of  df fdt  at  6  points  of  the  monochrome  image  /  arc  necessary 
and  sufficient  to  specify  the  motion  of  a  given  object.  If  we  add  color,  wc  And  that  for  2 
or  more  color  dimensions,  df/dt  need  only  be  known  at  3  non-collinear  points.  Also,  for 
2  or  more  color  dimensions,  the  optic  (low  is  gcncrically  uniquely  specified,  in  contrast  to 
the  monochrome  case,  where  there  is  a  1-dirncnsional  degeneracy. 

Wc  arc  going  to  use  the  notation  of  modern  abstract  geometry:  Lie  groups,  Lie  algebras, 
tangent  planes,  vector  fields  and  bundles,  etc.  This  lets  us  say  things  very  compactly 
and  simply,  once  the  definitions  arc  understood.  Everything  could  also  have  been  done 
without  these  abstractions  (except  maybe  the  use  of  gcncricity),  solely  in  the  language 
of  classical  calculus:  vectors,  rotation  matrices,  coordinate  systems,  etc,  just  as  any 

Recall  from  our  discussion  of  the  2-Color  Theorem  earlier  in  this  chapter,  that  a  generic  property  U  one 
which  is  true  for  a  typical  clement  of  a  space,  i.c  for  a  very  dense  subset  of  the  space.  For  this  section, 
wc  take  this  to  mean  an  open  dense  subset. 
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computer  program  can*be  written  in  machine  language,  using  absolute  addresses.  It  Is 
easier,  though,  to  understand  one  written  in  a  more  abstract  notation,  especially  if  you 
don’t  happen  to  be  its  author.  Instead  of  a  maze  of  calculations,  the  reader  is  presented 
with  simple  (but  rigorous)  descriptions.  Abstract  mathematical  treatment  actually  does 
more — it  lets  you  understand  a  whole  class  of  problems  at  once.  Incidentally,  this  is  really 
more  than  just  analogy;  the  process  of  specifying  concrete  objects  for  abstractions  can  be 
automated  into  a  compilation,  so  abstract  notation  can  actually  be  used  as  a  high-level 
programming  language. 


The  situation  is  again  that  of  Fig.  (*'),  except  now  the  nature  of  the  transformation  g 
will  be  paramount. 


Fig.  (*') 

We  arc  interested  in  rigid  motions  in  R3,  so  g  6  /£(3).  The  time  evolution  of  the  motion 
is  then  given  by 

1 :  R  ->  E( 3) 
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i.e.,  as  a  path  in  the  tAnsformation  group.  In  fact,  7  defines  a  1-parameter  family  of 
transformations.  Since  we  arc  interested  only  in  small  changes  from  the  current  state,  we 
take  7(0)  =  /,  the  identity  in  £7(3)  (we  could  have  done  this  anyway  by  using  the  group 
structure  to  translate  back  to  the  identity).  For  every  t,  7  gives  a  rigid  motion  of  R3, 
since  we  arc  identifying  £(3)  with  the  rigid  motions  of  R3: 


7(t) :  R3  -♦  Rs 


Bach  point  of  R3  is  carried  along  with  this  motion,  and  describes  a  path  in  R3.  In 
particular,  every  point  of  our  surface  of  interest,  embedded  in  R3,  has  such  a  path.  Now 
apply  the  imaging  projection,  and  restrict  attention  only  to  the  visible  surface  of  the 
embedded  object.  By  composition,  this  leads  to  a  path  through  each  point  that  gets  hit 
in  the  image.  Now  consider  only  a  single  lime,  t  =  0.  The  structure  we  have  presented 
thus  far  is  summarised  in  Fig.  (flow). 
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Fig.  (flow) 


Each  such  path  in  the  picture  has  a  velocity  vector,  and  each  point  in  the  image  has  a 
path,  so  there  is  a  vector  field  defined  on  the  image.  This  is  what  is  usually  referred  to  as 
the  optic  flow,  though  it  would  be  more  consistent  with  mathematical  terminology  to  call 
its  integral,  i.c.  the  paths  in  the  image,  the  optic  flow.  We  will  reserve  the  term  optic  flow 
for  this  integral,  i.c.  the  map  <pt  :  U  -*■  R2  which  specifies  the  paths  of  corresponding 
points  in  the  picture  with  initial  points  in  the  region  U,  while  using  optic  velocity  field  or 
optic  vector  field  for  its  instantaneous  velocities,  the  vectors  d<pt/dt.  Similarly,  the  paths 
in  R3  define  a  vector  field  on  R3,  and  the  path  7  in  E( 3)  defines  a  tangent  vector  at  the 
identity  in  E( 3). 


Geometric  Methods  in  Vision 


Motion,  Optic  Flow,  and  Lie  Algebra*  211 


Fig.  (vector  Reids) 


The  available  data,  however,  is  not  the  optical  flow  or  vector  field,  but  the  lime-varying 
picture  function  /(  which  is  just  the  projection  of  the  intrinsic  surface  function  F ,  under 
the  same  approximations  we  used  in  choosing  a  mathematical  structure  at  the  beginning 
or  this  chapter.  Since  we  arc  considering  only  the  differential  theory,  we  regard  our  data 
as  telling  us  only  the  instantaneous  value  /o,  and  all  the  time  derivatives  at  <  =  0.  This 
is  the  same  as  knowing  the  Taylor  series  for  ft-  We  will  only  use  the  1st  derivative  for 
now.  At  a  point  p  of  the  image,  call  the  optic  [low  vector  v.  Then  in  a  frame  with  velocity 
v  at  p  in  the  image,  ft  docs  not  appear  to  change;  the  optic  flow  specifics  the  motion  of 
corresponding  points.  Thus  if  we  leave  the  frame  fixed,  we  see  that 
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jtm  =  -DM.m, 

where  Dv  means  differentiation  by  the  vector  v,  equivalent  to  v  •  V,  so  that 

£/.(?)  = -»V/t(p),  (*) 

(The  more  formal  version  of  this  theorem  can  be  found  on  p.  91  of  (Abraham  and  Marsden 
1978],  and  was  slated  by  Marius  Sophus  Lie  in  1890.  It  is  well-known  in  the  context  of 
optic  flow;  see  c.g.  (Horn  and  Schunck  1980,  Ballard  and  Brown  1982].)  Equation  (*) 
shows  how  it  is  that  we  only  have  partial  information  about  v:  we  only  know  1  component. 
We  can  immediately  sec,  also,  that  if  /  had  multiple  dimensions,  i.c.  if  there  were  more 
than  1  color  dimension,  we  would  have  information  about  multiple  components,  and  v 
would  be  uniquely  determined  Tor  generic  /.  This  is  the  differential  version  of  the  2-color 
theorem  we  proved  earlier.  Finding  optic  flow,  like  matching,  is  much  easier  with  color. 
We  formalize  this  in 

Theorem.  (2-color  theorem  for  optic  flow)  For  a  generic  time-varying  image  function 
ft  :  M 3  — »  R",  the  optic  flow  vector  is  uniquely  specified  at  a  generic  point  of  the  image 
if  n  >  2,  i.c.  for  2  or  more  color  dimensions. 

When  we  fix  t  =  0,  each  side  of  equation  (*)  is  just  a  number,  so  Tor  each  p  wc  have  a 

map 

D,(f)(p) :  v  *-►  a  real  number 

Wc  have  thus  defined  a  string  of  linear  mappings  (v.f.  stands  Tor  vector  field,  v.b.  for 
vector  bundle): 

tangent  vector  on  IC( 3)  *-♦  v.b.  section  on  object 

►  v.f.  on  image  »-»  vector  at  p*-»  real  number 

(Wc  must  consider  sections  of  a  vector  bundle  on  the  object  rather  than  vector  fields 
(sections  of  the  tangent  bundle)  because  the  vectors  wc  are  interested  in  arc  tangent 
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vectors  to  paths  in  R3  going  through  points  or  the  object.  Since  the  paths  generally  do 
not  lie  in  the  object,  their  tangent  vectors  needn't  be  in  the  tangent  space  of  the  object, 
but  rather  arc  merely  tangent  vectors  in  R3.) 

A  1-form  is  a  map  which  takes  a  vector  field  and  spews  out  a  scalar  field,  linearly  at  each  point.  Le. 
at  each  point  it  linearly  maps  vectors  to  numbers.  Thus  it  is  dual  to  the  notion  of  a  vector  field.  A 
function  /  has  a  canonical  1-form,  df  associated  with  it  by  looking  at  how  the  function  changes  along 
paths.  Consider  a  vector  d  at  the  point  p.  To  define  df  at  p,  we  must  specify  a  number  to  which  it  will 
send  v.  u  can  be  thought  of  as  the  tangent  vector  to  some  path,  say  7:1-*  M,  so  that  v  —  *y'(0).  Then 
wc  can  define  df  by 

#(«)=  41  /(tm) 

“l#»0 

df  is  sometimes  called  the  differential  of  /.  The  space  of  all  tangent  vectors  at  a  point  is  called  the 
tangent  space.  A  linear  map  from  a  vector  space  to  the  reals  is  called  a  dual  vector,  and  the  space  of 
such  maps,  the  dual  apace  of  the  original  vector  space.  The  dual  space  of  the  tangent  space  is  called  the 
cotangent  space,  and  its  elements  covectora  The  disjoint  union  of  the  tangent  spaces  at  all  the  points  of 
a  manifold  is  called  the  tangent  bundle,  and  that  of  cotangent  spaces,  the  cotangent  bundle.  The  manifold 
that  the  vectors  were  originally  tangent  to  is  called  the  base  space.  Iloth  bundles  have  natural  structures 
as  manifolds  of  dimension  double  that  of  the  base  space.  A  n.  :.<p  which  assigns  to  each  point  of  the 
base  manifold  an  clement  of  its  (co-)langcnl  space  at  that  point,  is  called  a  section  of  the  bundle.  In  the 
context  of  bundles,  the  fiber  over  a  point  is  the  (co-)lahgenl  space  of  the  original  manifold  at  that  point. 
A  section  chooses  a  point  in  each  fiber.  The  tangent  space  at  a  point  p  of  M  is  written  TPM,  the  tangent 
bundle  TM,  the  cotangent  space  at  p  is  TpM,  and  the  cotangent  bundle  T’  M.  Thus  df  is  a  section  of 
the  cotangent  bundle  T’M.  V/,  however,  is  a  section  of  the  tangenl  bundle,  since  it  is  vector-valued.  It 
can  only  be  defined  if  there  is  a  canonical  isomorphism  between  the  tangent  and  cotangent  bundles,  e.g. 
if  a  metric  is  defined,  or  equivalently,  a  dot  product.  Wc  will  be  confining  our  attention  mainly  to  df. 
Instead  of  using  tangent  spaces  to  make  a  bundle,  we  can  replace  the  role  of  the  tangent  space  with  an 
arbitrary  vector  space,  yielding  a  vector  bundle.  A  Lie  group  a  a  manifold  which  also  has  a  group  structure 
such  that  the  group  opcrnlion  is  a  smooth  map.  lixamples  are  matrices  of  nonxero  determinant,  and 
rotation  groups.  I.ikc  any  other  manifold,  a  l.ie  group  lias  a  tangent  space  at  each  point.  1  localise  of 
the  group  structure,  though,  vectors  at  the  identity  element  or  the  hie  group  can  lie  moved  around  the 
manifold  by  the  group  action,  so  it  is  enough  for  most  purposes  to  consider  only  the  tangent  space  at 
l lie  identity.  This  space  is  called  the  Lie  algebra  associated  with  the  hie  group.  It  is  an  algebra  because 
in  addition  to  the  vector  space  structure,  lltcrc  is  a  multiplication,  called  the  Lie  bracket  The  bracket 
measures  what  Hie  hie  group  does  to  one  vector  as  it  moves  it  along  in  the  direction  specified  by  the 
other  vector.  The  hie  algebra  captures  tlie  infinitesimal  behavior  of  its  associated  hie  group. 


The  Lie  algebra  g  of  a  Lie  group  G  is  a  vector  space  which  can  be  identified  with  the 
tangent  apace  of  G  at  the  identity.  E[ 3)  is  a  Lie  group,  and  therefore  associated  with  it 
is  the  Lie  algebra  <(3);  and  since  E{3)  is  a  6-dimensional  manifold,  ((3)  is  a  6-dimcnaional 
vector  apace.  The  tangent  vector  7'(0),  which  is  the  instantaneous  motion,  can  therefore 
be  thought  of  as  an  clement  of  the  Lie  algebra  e(3). 

Wc  can  do  this  for  every  path  7,  hence  Tor  every  clement  of  e(3),  giving  us  a  homomorphism 
from  the  Lie  algebra  «(3)  to  sections  of  the  vector  bundle  on  the  object,  and  likewise 
again  to  a  Lie  algebra  of  vector  fields  on  the  image  of  the  object  in  the  image  plane.  The 


Geometric  Methods  in  Vision 


Motion,  Optic  Flow,  and  Lie  Algebras  214 


composition  of  these  is  a  Lie  algebra  homomorphism.  The  sequence  of  linear  maps  can 
therefore  be  written 

Lie  algebra  e(3)  -*  v.b.  sections  on  object 

— *  v.f.’s  on  image  — *•  vectors  at  p  — ►  real  numbers 

This  defines  a  map  t(3)  — *  R,  i.c.  an  element  of  (*(3),  the  dual  of  e(3). 

Now  we  have  enough  machinery  to  attack  some  questions.  The  first  question  is  whether 
there  is  enough  information  in  df/dt  to  uniquely  specify  the  instantaneous  motion,  for 
generic  /.  The  instantaneous  motion  is  an  element  of  t(3).  As  we  just  saw,  for  each  point 
p  of  the  image,  the  geometry  defines  an  clement  of  <*(3).  The  question  then  becomes 
whether  we  can  span  all  of  e*(3)  by  ranging  over  all  points  of  the  image,  for  knowing  the 
value  or  applying  a  dual  basis  in  e*(3)  uniquely  specifics  the  original  vector  in  t(3).  e*(3) 
is  6-dimcnsional,  so  if  this  is  possible,  it  is  possible  for  6  points  corresponding  to  a  dual 
basis.  This  doesn’t  say  anything  yet  about  finding  the  shape  or  position  of  the  object;  we 
only  want  to  know  whether  we  can  recover  the  motion  for  fixed  shape  and  position. 

Theorem  (6  point  df/dt  theorem).  Let 

/  :  I  X  1/  — ►  R 
(Lp)*-*  f{t,p) 

be  a  lime- varying  picture  for  some  time  interval  I  around  0,  and  some  neighborhood  U  in 
the  image  plane  of  regular  values  of  the  imaging  projection  of  some  2-dimcnsional  object 
embedded  in  R'1.  If  /  comes  from  the  projection  or  a  generic  intrinsic  function  on  an 
object  undergoing  rigid  motion  in  Rl,  then  the  values  of 

g<0„) 

at  6  generic  points  p  6  U  arc  necessary  and  sufficient  to  uniquely  specify  the  instantaneous 
motion  of  the  object. 
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Proof.  We  are  in  effect  measuring  the  optic  velocity  field  with  our  image  function;  this 
is  what  equation  (♦)  says.  To  be  able  to  tell  the  difference  between  difTcrcnt  elements  of 
e(3),  i.e.  different  motions,  the  mapping  from  t(3)  to  velocity  fields  on  the  picture  must 
be  1-1.  Since  the  mapping  is  a  vector  space  homomorphism,  this  is  the  same  as  saying  it 
has  no  (nontrivial)  kernel.  The  homomorphism 


t(3) 


v.b.  sections  on  object 


has  no  kernel,  because  any  kernel  would  leave  the  entire  object  fixed,  but  a  rigid  motion 
of  R3  can  leave  at  most  a  line  fixed.  So  e(3)  is  mapped  1-1  to  sections  of  bundles  on  the 
object.  Now  we  must  show  that  the  kernel  of  the  homomorphism 


v.b.  sections  on  object  — »  v.f.’s  on  image 


doesn’t  contain  anything  that  comes  from  the  previous  map  from  e(3).  The  kernel  of 
the  current  map  is  just  the  sections  whose  vectors  lie  along  the  rays  of  projection  to  the 
picture.  For  orthogonal  projection,  vertical  translation  would  of  course  be  in  this  kernel, 
but  we  arc  assuming  a  projective  projection,  i.e.  that  the  rays  all  meet  at  a  point;  for  a 
planar  retina  this  is  the  usual  perspective  projection. 


«k-*  «u"  . ■v.', 
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Fig.  (kernel-rays) 


We  have  to  show  that  any  such  motion,  where  points  move  only  along  rays,  cannot  come 
from  a  rigid  motion.  This  is  easy  to  sec;  take  3  points  on  the  object  not  all  on  the  same 
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Fig.  (3  points) 

Since  a  rigid  motion  of  R3  can  only  leave  a  single  line  ax'iB  (or  nothing)  fixed,  at  least  1 
of  Uic  points  must  move,  say  a.  If  a  moves  down,  b  must  move  up,  to  keep  their  distance 
constant  (rigid  motion).  Since  6  is  moving  up,  c  must  move  down.  Hut  then  a  and  c 
are  both  moving  down  ami  therefore  narrowing  their  distance,  showing  that  the  motion 
cannot  be  a  rigid  motion,  i.c.  the  kernel  of 


v.b.  sections  on  object  -»  v.f.’s  on  image 


is  not  in  the  image  of 


*(3)  — *■  v.b.  sections  on  object 


(except  for  0,  of  course). 

So  we  know  that  the  composition 


*(3)  — ►  v.f.’s  on  image 
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has  no  kernel,  i.e.  is  1-1«  This  means  that  every  rigid  motion  gives  a  unique  optic  velocity, 
field,  and  the  vector  space  of  such  fields  is  6-dimensional. 

Actually,  we  showed  more  than  that.  We  showed  that  a  generic  set  of  3  points  cannot 
stay  fixed  in  the  image — we  didn’t  even  have  to  consider  the  whole  vector  field.  The  set 

> 

of  vectors  at  3  such  points  in  the  image  make  up  a  6-dimensional  vector  space,  so  what 
we  showed  is  that  the  map 

e(3)  — »  vectors  at  3  given  points  in  image 

has  no  kernel,  i.e.  is  1-1. 

That  means  that  to  specify  a  motion,  i.e.  an  clement  of  t(3),  wc  only  have  to  figure  out 
the  optic  velocity  vectors  at  3  points.  A  generic  function,  via  equation  (*),  tells  us  l 
component  of  each  of  the  vectors  (by  gcncricity,  the  gradient  is  nonzero  at  all  3  points). 
If  we  had  2  generic  functions,  then  wc  could  recover  both  components  of  each  of  the  3 
vectors  by  using  equation  (*)  for  both  functions  (gcnerically,  the  gradients  will  be  linearly 
independent,  i.e.  in  different  directions  at  the  3  points).  Parenthetically,  wc  have  just 
proved 

Corollary  (2  colors,  3  points).  Tor  generic  /  taking  values  in  2  or  more  color  dimensions, 
the  values  of  df/dt(0,p)  at  3  noncollincar  points  p  £  U  arc  necessary  and  sufficient  to 
uniquely  specify  the  instantaneous  motion  or  the  object. 

Now  wc  must  show  that  I  component  at  each  of  6  points  is  as  good  as  2  components  at 
each  of  3  points. 

Wc  saw  earlier  that  df  defines  an  clement  of  e*(3).  Thus  the  geometry  defines  a  map 


References 


232 


Duda  and  Hart  1073]  « 

Duda,  R.O.  and  P.G.  Hart,  Pattern  Classification  and  Scene  Analysis, 
Wiley,  New  York,  1973. 

(cited  on  p.  14,15,19,41,81) 

Fennema  and  Brice  1970] 

Fennema,  C.L.  and  C.R.  Brice,  “Scene  analysis  of  pictures  using  regions”, 
Artificial  Intelligence  Journal  1,  1970,  205-226. 

(cited  on  p.  21,85) 

Cenncry  1977] 

Genncry,  Donald  B.,  “A  Stereo  Vision  System  for  an  Autonomous  Vehicle,” 
Proceedings  of  the  5th  International  Joint  Conference  on  Artificial  Intelligence, 
MIT,  Cambridge,  Massachusetts,  August  1977,  576-582. 

(cited  on  p.  178,179) 

Cenncry  1980] 

Genncry,  Donald  B.,  “Modelling  the  Environment  of  an  Exploring  Vehicle 
by  Means  or  Stereo  Vision,”  I’h.D.  thesis,  Stanford  Artificial  Intelligence 
Laboratory,  AIM  339,  June  1980. 

(cited  on  p.  154,178,179) 

Cilinorc  1981] 

Gilmore,  Robert  Catastrophe  Theory  for  Scientists  and  Engineers, 
Wiley- Intcrscienee,  New  York  1981.  [QAGI4.58.C54,  ISBN  0-471-05064-4]. 
(cited  on  p.  193) 

Golubitsky  and  Cuillcmin  1973] 

Colubitsky,  M.  and  V.  Cuillcmin,  Stable  Mappings  and  their  Singularities, 
(Graduate  Texts  in  Mathematics  14),  Springer,  Now  York,  1973.  [QA6I3.64.G64, 


References 


231 


March  1955. 

(cited  on  p.  9) 

(do  Car  mo  1976] 

do  Carmo,  M.P.,  Differential  Geometry  of  Curves  and  Surfaces,  Prentice* 
Hall,  Englewood  Cliffs,  N.J.,  1976. 

(cited  on  p.  45) 

[Drcschlcr  and  Nagel  1981a] 

Drcschler,  L.  and  Il.-lI.  Nagel,  “Volumetric  Model  and  31)-Trajcclory  of  a 
Moving  Car  Derived  from  Monocular  TV-Frame  Sequences  of  a  Street  Scene,” 
Report  m-HH-M- 90/81,  Fachbcrcich  Inforinatik,  Universitat  Hamburg. 

(cited  on  p.  18,48) 

(Drcschlcr  and  Nagel  1981b] 

Drcschler,  L.  and  1I.-H.  Nagel,  “Volumetric  Model  and  3D-Trajcctory  of 
a  Moving  Car  Derived  from  Monocular  TV-Frame  Sequences  of  a  Street 
Scene,”  Proceedings  of  the  Seventh  International  Joint  Conference  on  Artificial 
Intelligence  (IJCAI-81),  August  1981,  Vancouver. 

(cited  on  p.  18,48) 

(Duda  and  Hart  1971] 

Duda,  R.O.  and  P.E.  Hart,  “A  Generalized  Hough  Transformation  for  Detecting 
Lines  in  Pictures,"  Sill  A1  Croup  Tech  Note  36,  1971. 

(cited  on  p.  19,81) 

(Duda  and  Hart  1972] 

Duda,  R.O.  and  P.E.  Hart,  “Use  of  the  Hough  Transformation  to  Detect  Lines 
and  Curves  in  Pictures,”  Comm.  ACM  15,  no.  1,  1972,  11-15. 

(cited  on  p.  19,81) 


References 


230 


[Crowley  1982] 

Crowley,  James  L.,  “A  Representation  for  Visual  Information,”  Ph.D.  disserta¬ 
tion,  Robotics  Institute,  Carnegie-Mcllon  University,  1982. 

(cited  on  p.  202) 

[Crowley  and  Parker  1984] 

Crowley,  James  L.  and  Alice  C.  Parker,  UA  Representation  Tor  Shape  Rased  on 
Peaks  and  Ridges  in  the  Difference  of  Low  Pass  Transform,”  IEEE  Transactions 
on  Pattern  Analysis  and  Machine  Intelliger  :e,  vol.  6,  no.  2,  March  1984,  156- 
170. 

(cited  on  p.  202) 

[Crowley  and  Stern  1984] 

Crowley,  James  L.  and  Richard  M.  Stern,  “Fast  Computation  of  the  Difference 
of  Low  Pass  Transform,"  IEEE  Transactions  on  Pattern  Analysis  and  Machine 
Intelligence,  vol.  6,  no.  2,  March  1984,  212-222. 

(cited  on  p.  202) 

[Davis  1973] 

Davis,  L.,  “A  Survey  of  Kdgc  Detection  Techniques”,  TR-273,  Univ  of  Md, 
Computer  Science  Center,  1973. 

(cited  on  p.  24,33) 

[DcBoor  1978] 

DcRoor,  C.,  A  Practical  Guide  to  Splines,  Springer,  1978  (Vol.  27  in  Applied 
Mathematical  Sciences  scries). 

(cited  on  p.  54) 

[Dinccn  1955] 

Dinccn,  G.P.,  “Programming  Pattern  Recognition,”  Proc.  WJCC,  94-100, 


References 


229 


(cited  on  p.  4), 

[Bruss  and  Horn  1983] 

Bruss,  Anna  R.  and  Berthold  K.P.  Horn,  “Passive  Navigation,”  Computer 
Vision,  Graphics,  and  Image  Processing,  21,  1983,  3-20. 

(cited  on  p.  206) 

[Buxton  and  Buxton  1983] 

Buxton,  B.F.  and  Hilary  Buxton,  “Monocular  Depth  Perception  from  Optical 
Flow  by  Space  Time  Signal  Processing,”  Proceedings  of  the  Royal  Society  of 
London  B,  vol.  218,  1983,  27-47. 

(cited  on  p.  206) 

[Canny  1983] 

Canny,  John  Francis,  “Finding  lodges  and  Lines  in  linages,”  MIT  Master’s  thesis, 
MIT  AJ-TR-720,  June  1983. 

(cited  on  p.  18,42,48,54,55,56,58,62,63,112) 

[(’how  and  Kancko  1972] 

Chow,  C.K.  and  T.  Kancko,  “Boundary  Detection  or  Radiographic  Images  by  a 
Threshold  Method,"  in  frontiers  of  Pattern  Recognition,  S.Walanabc,  Ed., 
Academic  Press,  New  York,  1972,  61-82. 

(cited  on  p.  24) 

[Chow  and  Male  1982] 

Chow,  Shui-Ncc  and  Jack  K.  Hale,  Methods  of  Bifurcation  Theory,  vol. 
251  of  Grundlehren  der  mathematischen  Wissenschaften,  Springer- Vcrlag,  New 
York,  1982.  [QA372.C544,  ISBN  0-387-90664-9]. 

(cited  on  p.  199) 


References 


228 


J.  Phyt.  A:  Math.  Gen.,  vol.  10,  no.  11,  1977,  1809-1821. 

(cited  on  p.  199) 

[Binford  1970] 

Binford,  Thomas  O.,  “The  TOPOLOGIST,”  Internal  Report  MIT-AJ,  1970. 
(cited  on  p.  20) 

{Binford  1981] 

Binford,  Thomas  O.,  “Inferring  Surfaces  from  Images,”  Artificial  Intelligence, 
17,  1981,  205-244. 

(cited  on  p.  18,102,113,114,122) 

(Blichcr  and  Ornohundro  1984] 

Blicher,  A.  Peter  and  Stephen  M.  Ornohundro,  “Convolution  with  Gaussians  has 
Generic  Bifurcations,”  In  preparation. 

(cited  on  p.  80,204) 

[Brice  and  Fonncina  1970] 

Brice,  C.R.  and  C.L.  Fcnncma,  “Scene  Analysis  Using  Regions,”  Artificial 
Intelligence  Group  Technical  Note  17,  Stanford  Research  Institute,  April  1970. 
(cited  on  p.  21,85) 

[Brdckcr  and  Lander  1975] 

Brocker,  T.  and  L.  Lander,  Differentiable  Germs  and  Catastrophes, 
London  Mathematical  Society  Lecture  Note  Scries  17,  Cambridge  University 
Press,  Cambridge,  1975.  (ISBN  0-521-20081-2]. 

(cited  on  p.  182) 

[Brooks  1981] 

Brooks,  Rodney  A.,  “Symbolic  Reasoning  Among  3-D  Models  and  2-D [  Images,” 
Artificial  Intelligence,  17,  1981,  285-  348. 


References 


227 


[Ballard  and  Brown  1982] 

Ballard,  Dana  Harry  and  Christopher  M.  Brown,  Computer  Vision,  Prentice- 
Hall,  Englewood  Cliffs,  1982.  [TA1632.B34,  ISBN  0-13-165316-4). 

(cited  on  p.  212) 

[Bcaudet  1978] 

Bcaudct,  P.R.,  “Rotationally  Invariant  Image  Operators,”  in  Proceedings  of 
the  Fourth  International  Joint  Conference  on  Pattern  Recognition  (IJCPR-78), 
(Kyoto,  Japan,  November  7  10,  1978),  579-  583. 

(cited  on  p.  18,19,44,47,48,49,50) 

[Babaud,  Wilkin,  Duda  1983] 

Babaud,J.,  Andrew  Wilkin,  and  Richard  Duda,  “Uniqueness  of  the  Caussian 
Kernel  for  Scale-space  Filtering,”  Fairchild  Laboratory  for  Artificial  Intelligence 
Research,  Fairchild  TR  645,  FLAIR  Memo  22,  1983. 

(cited  on  p.  203) 

[Barnard  and  Fischlcr  1982] 

Barnard,  S.,  and  M.  A.  Fischlcr,  “Computational  Stereo,”  ACM  Computing 
Surveys,  vol.  14,  no.  4,  December  1982. 

(cited  on  p.  178) 

[Berry  1977] 

Berry,  M.V.,  “Focusing  and  Twinkling:  Critical  Exponents  from  Catastrophes 
in  Non-Caussian  Random  Short  Waves,"  J.  Phys.  A:  Math.  Gen.,  vol.  10,  no. 
12,  1977,  2061-2081. 

(cited  on  p.  199) 


[Berry  and  llannay  1977] 

Berry,  M.V.  and  J.H.  llannay,  “Umbilic  Points  on  Caussian  Random  Surfaces,” 


References 


226 


[Arnold,  VJ.  1983] 

Arnold,  Vladimir  Igorevich,  “Singularities  of  Systems  of  Rays,”  Uapekhi  Mat. 
Nauk,  38:2,  1983,  77-147  Available  in  English  translation  as:  Russian  Math. 
Surveys,  38:2,  1983,  87-176. 

(cited  on  p.  148,223) 

[Arnold  1984] 

Arnold,  Vladimir  Igorevich,  Catastrophe  Theory,  translation  of  Teoriia 
Katastrof,  Springer- Vcrlag,  Berlin,  Heidelberg,  New  York,  Tokyo,  1984. 
[QA614.58.A7613  1984,  ISBN  0-387-12859-X  (USA),  ISBN  3-540-12859-X 
(FDR)]. 

(cited  on  p.  148,199) 

[Baker  1981] 

Baker,  II.  Ilarlyn,  “Depth  from  Edge  and  Intensity  Based  Stereo,”  University 
of  Illinois,  Ph.D.  thesis,  September  1981;  also  available  as  AIM-347,  Stanford 
Artificial  Intelligence  Laboratory,  September  1982. 

(cited  on  p.  154,178) 

[Baker  cl  al.  1983] 

Baker,  II.  Ilarlyn,  Thomas  O.  Binford,  Jitcndra  Malik,  and  Jean-Frcdcric  Mcller, 
“Progress  in  Stereo  Mapping,"  Proceedings  of  the  DARPA  Image  Understanding 
Workshop,  Arlington,  VA.,  June  1983,  327-335. 

(cited  on  p.  155,178) 

[Ballard  and  Sklansky  1976] 

Ballard,  Dana  Harry  and  J.  Sklansky,  “A  Ladder-structured  Decision  Tree  for 
Recognising  Tumors  in  Chest  Radiographs,”  IEEE  Transactions  on  Computers, 
vol.  C-25,  5,  May  1976,  503  -513. 

(cited  on  p.  19,83) 


References 


[Abdou  1978] 

Abdou,  I.G.,  “Quantitative  Methods  of  Edge  Detection,”  Ph.D.  thesis,  University 
of  Southern  California,  July  1978.  Also  USCIPI  Report  830. 

(cited  on  p.  14,17,31,33,39,41) 

[Abraham  and  Marsdcn  1978] 

Abraham,  Ralph  (I.  and  Jcrrold  G.  Marsdcn,  Foundations  of  Mechanics, 
2nd  ed.,  Benjamin/Cummings,  Reading,  Mass.,  1978.  [QA805.A2  1977,  ISBN 
0-8053-0102- X]. 

(cited  on  p.  160,163,190,193,212) 

[Abraham  and  Shaw  1981] 

Abraham,  Ralph  II.  and  Christopher  D.  Shaw,  Dynamics- — the  Geometry  of 
Behavior,  voj.  1  of  The  Visual  Mathematics  Library,  Aerial  Press,  Santa  Crus, 
1981.  [ISBN  0-942344-01-4]. 

(cited  on  p.  193) 

[Altes  1975] 

Altcs,  R.A.,  “Splinc-likc  Image  Analysis  with  a  Complexity  Constraint. 
Similarities  to  Human  Vision,”  unpublished  paper,  ca.  1975,  36  pp. 

(cited  on  p.  17,19,35) 

[Arnold,  R.D.  1983] 

Arnold,  R.  David,  “Automated  Stereo  Perception,”  Department  of  Computer 
Science,  Stanford  University,  Ph.D.  thesis,  1983. 

(cited  on  p.  178) 


Postscript 


224 


We  haven’t  finished  with  the  regular  points,  either.  The  Lie  algebra  analysis  we  began 
can  be  extended  to  analyzing  the  problems  of  finding  the  shape  and  motion  of  moving 
objects.  The  questions  of  what  information  is  necessary  should  be  resolvable. 

We  have  thus  far  mainly  ignored  the  problems  of  photometry.  [Kocndcrink  and  van  Doom 
1980]  have  pioneered  in  applying  geometric  methods  here.  The  Lie  algebra  approach  can 
be  extended  to  include  photometry  by  considering  not  just  the  object  in  space,  but  a 
double  sphere  bundle  over  it,  describing  the  directions  of  light  and  observer.  Part  of 
this  is  already  implicit  in  the  Gaussian  sphere  approach,  for  example.  There  arc  many 
interesting  results  that  may  be  of  use.  Lines  of  principal  curvature  seem  to  be  important, 
but  it  is  only  recently  that  the  topology  of  the  lines  of  principal  curvature,  i.c.  how  they 
fill  out  the  surface,  has  been  thoroughly  understood  [Solomayor  1984). 

All  this  geometry  must  be  brought  to  bear  to  get  local  and  global  understanding  of  the 
image  intensity  function,  the  right  type  of  understanding  to  make  deductions  about  the 
physical  situation  that  produced  it.  We  have  been  arguing  that  an  important  clement  is 
qualitative,  i.c.  geometric  understanding,  rather  than  quantitative.  A  picture  is  not  a  C°° 
function,  so  there  is  a  problem  of  how  to  derive  this  information.  The  data  results  from 
a  map  from  an  infinite-dimensional  space  of  smooth  functions  to  a  finite-dimensional  one 
of  values  on  a  grid,  and  indeed  to  a  finite  set  of  digitized  values.  The  relation  of  these 
maps  to  the  smooth  theory  has  to  be  looked  at  carefully.  Probably  the  most  direct  way 
to  apply  theory  for  smooth  functions  to  this  data  is  to  choose  some  smooth  function  to 
represent  it,  i.c.  fit  the  data.  How  to  do  the  fit?  There  arc  many  choices:  polynomials, 
Fourier  inlerpolanls,  spheroidal  harmonics,  etc.  The  mathematics  of  lilting  is  partially 
independent  of  what  is  being  fit,  so  it  should  be  possible  to  obtain  a  theory  without 
making  a  choice  of  basis  at  the  outset.  The  same  philosophy  should  be  transferable  to 
implementation:  the  program  could  be  designed  to  take  the  basis  as  data. 
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It’s  customary  to  conclude  a  thesis  with  a  compendium  of  “future  research  directions,” 
the  research  that  should  have  been,  but  wasn’t,  done  for  the  present  work,  but  will  be 
sometime  soon.  In  adhering  to  this  tradition,  I  present  here  a  sketch  of  a  program  of 
research  that  continues  what  was  started  here. 

The  imaging  projection  lias  regular  points  and  singular  points.  Interesting  edges  occur 
at  the  singular  points  (which  arc  generally  limbs),  but  our  geometric  analysis  has  been 
confined  mostly  to  the  regular  points,  mainly  because  it’s  easier.  We  still  had  to  consider 
singular  points,  but  they  were  or  lower  dimension.  A  large  theory  exists  for  singularities  of 
stable  mappings;  it  is  wailing  to  be  applied.  (Kocndcrink  and  van  Doom  1976,  Kocnderink 
and  van  Doom  1980,  Kocndcrink  and  van  Doom  1982)  have  begun  some  o r  this  work. 
First  you  have  to  classify  the  singularities  which  can  occur.  There  arc  only  2  singularities 
Tor  generic  maps  from  the  plane  to  the  plane:  the  fold  and  the  cusp.  In  a  masterly  work, 
[Arnold,  V.l.  1 983]  suggests,  however,  that  the  right  setting  is  singularities  of  a  projection 
from  a  generic  embedding.  This  is  not  quite  the  same  jus  a  generic  map  between  surfaces, 
and  Arnold  and  his  coworkers  have  found  that  there  arc  exactly  1 4  types  of  singularities 
in  this  setting. 

We  have  stressed  that  picture  data  only  reveals  geometry  via  the  measuring  device  of 
the  image  intensity.  It  is  therefore  necessary  to  go  beyond  the  projection  singularities 
themselves,  and  study  how  they  may  be  inferred  from  the  image  function.  This  is  the 
generalization  of  the  edge  detection  intuition:  “look  for  discontinuities.” 

To  all  this,  we  can  add  time.  This  leads  to  the  study  of  unfoldings  of  singularities,  and 
again  the  time-varying  image  intensity  is  the  telltale,  and  the  Lie  group  and  Lie  algebra 
of  the  motion  will  be  the  instruments  of  analysis. 
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If  we  remove  a  vector  from  one  of  the  3  original  points  (i.c.  remove  a  point  from  the  set  in 
T* Ra),  this  leaves  us  with  a  1 -dimensional  kernel  in  e(3).  ir  we  go  to  one  or  the  new  points, 
the  spanning  lemma  tells  us  we  can  again  measure  the  kernel,  perhaps  after  an  arbitrarily 
small  perturbation.  This  can  be  repeated,  and  2  more  measurements  moved  the  same 
way,  to  get  6  vectors  at  6  points,  corresponding  to  V/,  perhaps  slightly  perturbed. 
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Now  we  can  see  what  happens  when  we  choose  6  points  in  the  image,  if  gives  us  6 
points  in  T*R2.  We  can  perturb  these  points  to  guarantee  that  if  ^  0.  Now  since  every 
neighborhood  of  each  point  maps  to  a  spanning  set  of  e*(3)  (local  spanning  lemma),  we 
can  always  perturb  the  nth  point  so  that  it  is  mapped  to  something  outside  the  span  of 
the  first  n  —  1  points  (at  least  through  n  =  6,  anyway).  This  gives  a  perturbation  of  the  6 
points  which  maps  to  a  spanning  set.  Since  spanning  sets  are  open,  these  points  will  still 
span  under  sufficiently  small  perturbation.  (In  general,  one  might  need  a  perturbation 
of  both  the  location  of  the  points  and  of  /  to  guarantee  a  spanning  set.  The  degenerate 
situation  occurs  when  the  optic  velocity  vector  is  in  the  direction  or  constant  /.)  QED| 

Here  is  a  more  concrete  way  of  looking  at  the  last  part  of  the  proof.  We  already  saw  that 
if  we  had  2  generic  functions  then  we  would  be  finished  with  3  generic  points.  This  is  the 
situation  of  Fig.  (3  fibers).  It  can  be  pictured  in  the  image  as  in  Fig.  (vector  points). 


Each  vector  represents  a  direction  in  which  the  optic  velocity  vector  can  be  measured, 
i.c.  a  value  or  V/  for  one  of  the  functions.  We  want  to  get  rid  of  one  of  th^ac  at  each 
point,  and  substitute  measurements  at  3  new  points  that  have  been  given  to  us. 
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Proof.  Choose  a  point  and  neighborhood  in  T*Ra.  It  projects  to  a  neighborhood  ofRa, 
in  which  we  can  choose  3  generic  points.  We  can  then  choose  6  points  in  T*Ra,  2  to  a 
fiber,  by  the  3  fiber  lemma.  QICD  (local  spanning). 
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What  we  saw  earlier  is  « 

Lemma(3  fiber  lemma).  If  we  choose  3  generic  points  in  R2,  and  2  linearly  independent 
covectors  in  each  fiber  over  those  points,  the  6  resulting  points  of  T* R2  are  mapped  to  a 
spanning  set  in  t*(3). 

T*R2  is  4-dimensional,  so  it  is  a  little  hard  to  draw.  We  represent  the  situation  schemati¬ 
cally  in  Fig.  (3  fibers). 


Fig.  (3  fibers) 


What  wc  will  now  show  is  that  we  can  choose  any  fi  generic  points  in  T*R2,  i.c.  6  generic 
points  in  the  image,  ami  f>  generic  values  of  df  at  those  points  (i.c.  a  generic  /).  This 
is  pretty  easy  by  making  use  of  the  3  fiber  lemma.  The  lemma  still  applies  for  any 
neighborhood  ofR2,  i.c.  we  can  choose  the  3  points  arbitrarily  close  together.  This  gives 
us 


Lemma(loca)  spanning).  Rvcry  neighborhood  of  every  point  in  T'R2  contains  6  points 
which  arc  mapped  to  a  spanning  set  in  e*(3). 
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